 Research
 Open Access
 Published:
Statistical methods for constructing disease comorbidity networks from longitudinal inpatient data
Applied Network Science volume 3, Article number: 46 (2018)
Abstract
Tools from network science can be utilized to study relations between diseases. Different studies focus on different types of interdisease linkages. One of them is the comorbidity patterns derived from largescale longitudinal data of hospital discharge records. Researchers seek to describe comorbidity relations as a network to characterize pathways of disease progressions and to predict future risks. The first step in such studies is the construction of the network itself, which subsequent analyses rest upon. There are different ways to build such a network. In this paper, we provide an overview of several existing statistical approaches in network science applicable to weighted directed networks. We discuss the differences between the null models that these models assume and their applications. We apply these methods to the inpatient data of approximately one million people, spanning approximately 17 years, pertaining to the Montreal Census Metropolitan Area. We discuss the differences in the structure of the networks built by different methods, and different features of the comorbidity relations that they extract. We also present several example applications of these methods.
Introduction
In the last decade, several network approaches have been introduced to study the interrelations between human diseases. Networks are constructed by connecting diseases that share certain features, collapsing a bipartite graph into a unipartite graph. Examples include genetic/interactomic association (Goh et al. 2007; Halu et al. 2017; Menche et al. 2015), similarity of symptoms (Zhou et al. 2014; Halu et al. 2017), similarity of pertinent drugs (Yıldırım et al. 2007), commonality of etiological environmental factors associated with diseases (Liu et al. 2009), adjacency of metabolic reactions catalyzed by corresponding mutated enzymes (Lee et al. 2008), and cooccurrence in patients (Hidalgo et al. 2009; Folino et al. 2010; Chmiel et al. 2014; Jensen et al. 2014; Jeong et al. 2017). Also sometimes more than one of these networks are juxtaposed to build a multiplex characterization (Halu et al. 2017). All of these strands of research are beneficial and insightengendering in their respective contexts, and the increase in the breadth of topics and the diversity of approaches promises the emergence of a new field of research.
Here we focus on a methodological problem in this new field. We investigate different statistical methods for defining a weighted and directed comorbidity network from longitudinal hospital inpatient data, and show that different methods capture different aspects of comorbidity relations. We use a data set containing over a million people for a period of approximately 17 years, and employ different statistical methods to extract comorbidity networks based on this data set.
Some of the previous studies have used a binary version of the comorbidity networks to study the structural properties of diseases (Hidalgo et al. 2009; Folino et al. 2010; Chmiel et al. 2014; Jeong et al. 2017). Measures for establishing unweighted binary links between disease pairs include the ϕcorrelation (which is closely linked to the χ^{2} statistic) and relative risk (ratio of observed cooccurrence of a pair to the expected cooccurrence of a null model) (Hidalgo et al. 2009; Folino et al. 2010; Chmiel et al. 2014; Jeong et al. 2017). These methods capture useful information about comorbidities, and also have drawbacks. The ϕcorrelation underestimates the associations in disease pairs in which one disease is rare and the other is prevalent. The relative risk tends to overestimate linkages between rare diseases and to underestimate those between prevalent diseases. To use any of these methods, one inevitably chooses tradeoff parameters to construct the network with reasonable accuracy. Examples include the thresholds in Ref. Chmiel et al. (2014), the choice of relative risk cutoff (4 in Ref. Jeong et al. (2017) and 20 in Ref. Folino et al. (2010)), and the choice of defining “lopsided”ness if one direction of a reciprocal link weights at least twice as the other direction (Jeong et al. 2017). These thresholds are chosen to be intuitivelyreasonable values considering the respective settings.
In this paper, we study different systematic statistical methods for building weighted directed comorbidity networks. These methods use different criteria to deem statistical significance for links. The resulting networks are sparser than the raw network, and the links are in some sense adjudicated as meaningful, that is, nonnoise. In addition to statistical considerations, working with sparser networks is easier both computationally and intuitively, and the ultimate goal of gaining insight about paths of disease progression is facilitated. Here we investigate the effect of the statistical procedure used to build a network from the disease cooccurrence data on the structure of the resulting network. We show that depending on the null model used for defining the statistical significance of diseasedisease links, different aspects of the comorbidity patterns are captured, and the resulting networks can have different micro/meso structures, and the centrality/ranking measures of individual diseases can differ. We describe the networks built from each method, discuss their similarities and differences, and present several example applications using these constructed networks.
Data
Using the registry of all medically insured people in the province of Québec (fichier d’inscription des personnes assures  FIPA) we randomly sampled 25% of the people residing in the Montreal Census Metropolitan Area (CMA) in 1998. In each subsequent year, we used the FIPA to resample immigrants to the CMA and babies born to mothers residing in the CMA to maintain a representative, 25% sample for each year. For sampled individuals, we obtain regular data updates from the Régie de l’assurance maladie du Québec (RAMQ) on physician billing, drugs dispensed, hospitalization records, and death certificates. The data sets are linked with an anonymized unique identifier. At any given time, the dynamic cohort contains approximately 1 million people and followup data span approximately 17 years.
Moreover, in one of the applications that we present below, we use the dataset that is publicly available via Ref. Park et al. (2009) to connect our results to previous findings in the literature. In this data set, the protein–protein interaction (PPI) and coexpression networks and the interdisease network of shared genes are linked to the comorbidity network derived from US Medicare claims of over 13 million elderly patients. The data set can be accessed online via http://msb.embopress.org/content/5/1/262.
The analyses reported in this paper has been conducted using MATLAB R2015b.
Network construction methods
ICD codes
We use the ICD9 coding scheme for the classification of diseases. To make the analysis more tractable, we confine the analysis to the 3digit classification.
Network terminology
Throughout, the pathways of disease progression are modeled by a network, where nodes represent diseases and a link from node i to node j represents an instance of diagnosis of disease i followed by a subsequent diagnosis of disease j. We denote the number of connections of a node by its degree, denoted by k. The weight of the link from disease i to disease j is denoted by w_{ij}, which is equal to the number of times a diagnosis of disease i followed by a diagnosis of disease j is reported in the data set. By the strength (Serrano et al. 2009) of a node, denoted by s, we refer to the sum of the weights of its links. We use these for either directions of the links. For example, the ‘outstrength’ \(s_{x}^{\text {out}}={\sum \nolimits }_{y} w_{xy}\) denotes the sum of the weights of the outlinks of node x to other nodes, and the outdegree \(k_{x}^{\text {out}}\) denotes the number of such outlinks. The outstrength of a node is equal to the total number of times that the diagnosis of that disease was followed by the diagnosis of any other disease. The outdegree of a node is the number of distinct diseases that follow that particular disease, without counting the multiplicities. Similarly we can define the instrength and indegree for each node. We denote the sum of the strength of all links by S, that is, we have \(S={\sum \nolimits }_{ij} w_{ij}\).
Raw network
In our data set, there are 1,700,000 distinct hospital visits, and the total number of unique ICD9coded diagnoses is 6,500,000. Among all the hospital visits, 35.3% where given only one ICD9coded diagnosis. Figure 1 presents the histogram of the number of ICD9coded diagnoses per hospital visit. Table 1 presents the top 10 disease in the data set with highest prevalence. Figure 2 depicts the histogram of the prevalence of the diseases in our data set. The distribution of the logarithm of the prevalences is normallike, but the result of the KolmogorovSmirnov test was that the normality assumption is rejected (on the 0.1 level). Though not strictly lognormal, the prevalence distribution is evidently heavytailed, that is, most diseases have low levels of prevalence and a minority of the diseases have extremely high levels of prevalence. The starting point of our analysis is to build a raw network, which will be the substrate on which other methods construct different derived networks.
We seek a weighted and directed characterization of the comorbidity patterns, where the weight of the link from disease i to j equals the number of instances where a patient with disease i later developed disease j. The raw network is made by sweeping over every hospital visit, and incrementing the weight of the link from disease i to disease j if in that visit disease j is diagnosed for the first time in a patient for whom disease i had been previously diagnosed. In other words, if a patient who previously had disease i but did not have disease j visits the hospital and is diagnosed with disease j, then w_{ij} increments by one. There are hospital visits where two diseases are codiagnosed in a patient for the first time. We do not observe what the temporal order of their occurrence was before the patient visited the hospital. The link between the two diseases might be in either direction. We have two possible choices: either to discard this observation or to count this link in both directions. We choose the former, because we prefer less data to morebutnoisy data. About 35% of the patients in the data set only visited the hospital once, thus they did not constitute any comorbidity trajectory with the above criteria, and did not contribute to the raw network. The weight distribution of the links are depicted in Fig. 3. After undertaking the maximum likelihood method devised in Clauset et al. (2009), we conclude that although the distribution resembles a linear curve on the loglog scale, the hypothesis that the weight distribution is powerlaw is rejected. Thus throughout the paper we only use nonparametric statistical methods. We do not assume scalefreeness of the distributions.
Relative risk and observedtoexpected ratio
The relative risk (RR) is a measure of comorbidity strength used in previous studies of disease networks (Hidalgo et al. 2009; Park et al. 2009; Jeong et al. 2017). In these studies, the relative risk is equal to the ratio of the number of times that an ordered pair of diseases occur in the empirical data to the expected number of times it would occur in a random network. In Ref. Hidalgo et al. (2009), the threshold value for this quantity was obtained from formulas for confidence intervals given in Ref. Katz et al. (1978).
Here we make a notational clarification. The problem considered in Ref. Katz et al. (1978) is the problem of finding confidence intervals for risk ratios in the following sense. Consider the 2×2 contingency table given in Table 2. In the following calculations, we use the values of a,b,c,d defined in Table 2 for brevity of notation.
The relative risk considered in Ref. Katz et al. (1978), and conventionally in epidemiology and biostatistics, is defined as \({\left (\frac {a}{a+b}\right)/ \left (\frac {c}{c+d}\right)}\). But that is not how RR is defined in Refs. Hidalgo et al. (2009); Jeong et al. (2017); Park et al. (2009) to study comorbidity links. Rather, these studies define RR as \({\frac {a }{(a+b)(a+c)}}\). The numerator is the observed cooccurrence proportion, and the denominator is the expected cooccurrence proportion under independence. The quantity \({\frac {a}{(a+b)(a+c)}}\) is actually what is often called the observedtoexpected ratio. Herein we denote it by OER. So we use this terminology in our paper:
Note that we can equivalently write:
In this form, logOER is equivalent to the pointwise mutual information that is used, for example, in natural language processing to measure how likely two words are to cooccur (Bouma 2009). The confidence intervals for OER can be obtained by applying the delta method to logOER:
where T denotes transpose. So the 95% confidence intervals are obtained as follows: \( \text {CI}(OER)= \left [\widehat {OER} e^{1.96 \sqrt {\text {var}(\log OER)}}~,~ \widehat {OER} e^{+1.96 \sqrt {\text {var}(\log OER)}} \right ] \).
We re iterate that what we have defined here as OER is what the previous studies of comorbidity networks have referred to as the relative risk, and what we define below as the relative risk is not to be confused with their notation. As discussed above, for the relative risk we define:
So the relative risk is the ratio of the probability that j receives one of the outlinks of i to the probability that j receives a link from another disease that is not i. In other words, the relative risk is the ratio of the probability that disease j succeeds disease i to the probability that disease j succeeds any other disease. For the variance of the relative risk, we can proceed similar to before and obtain the following result which is equivalent to what is used in Ref. Hidalgo et al. (2009) for OER:
Despite the technical distinction between RR and OER, we can show that for practical purposes considered in this paper, these two measures are very close for almost all cases. Dividing Eq. 1 by Eq. 4, we have:
This ratio is very close to one if w_{ij} is much smaller than \(s_{j}^{\text {in}}\), that is, if disease i is not a main predecessor of disease j in comorbidity patterns. However, if the indegree of disease j is small, so that it has few predecessors, then RR might deviate from OER. In our data set, there are only 61 disease pairs for which \(w_{ij}/s_{j}^{\text {in}}\) exceeded 10%. Since this fraction is negligibly small, the RR and OER measures are therefore almost identical. We use the network constructed based on OER in the following analyses to be consistent with the measures used in the previous literature.
A practical caveat of OER is that diseases with very low prevalence can produce unduly large values of OER, which is evident from Eq. 2. This is also pointed out previously in the disease networks literature (Park et al. 2009). A workaround is to discard disease pairs for which the expected cooccurrence under independence is greater than a certain threshold. As investigated in Ref. Park et al. (2009), as long as the threshold exceeds unity, the structure of the OER comorbidity network remains robust against the choice of threshold. In the present paper, we choose the threshold to be equal to unity.
ϕ coefficient
The ϕ correlation coefficient is a measure of association for two binary variables (here, the binary variable indicates whether or not a certain disease is diagnosed). It quantifies the tendency of the two binary variables to cooccur, that is, the concentration of the contingency table towards the diagonal. Generalizing the undirected case considered before in the literature (Hidalgo et al. 2009), we can define the directed version of the ϕ coefficient as follows:
If the instances of disease j succeeding disease i are more frequent than the random case, the ϕ coefficient will be positive. If these instances are less frequent than it would be expected for the random case, the ϕ coefficient will be negative, and it means that having developed disease i actually decreases getting disease j. The caveat in the performance of the ϕ coefficient is that, the higher the disparity between the prevalences of the two diseases, the less informative the ϕ coefficient becomes (Hidalgo et al. 2009). A technical caveat of the ϕ coefficient is that, although it is always in the [−1,+1] interval, the absolute values of its theoretical extrema are less than unity (Guilford 1965; Davenport Jr and ElSanhurry 1991; Park et al. 2009). In the case of positive association, assuming \(s_{j}^{\text {in}}< s_{i}^{\text {out}}\), the maximum value of ϕ which is theoretically attainable is \(\phi _{\text {max}}=\frac {\left (Ss_{i}^{\text {out}}\right) s_{j}^{\text {in}}}{\left (Ss_{j}^{\text {in}}\right)s_{i}^{\text {out}}}\). Because both \(s_{i}^{\text {out}}\) and \(s_{j}^{\text {in}}\) are much smaller than S in our analysis, this maximum value is close to \(s_{j}^{\text {in}}/s_{i}^{\text {out}}\), which is less than unity. If we had started with the assumption \(s_{j}^{\text {in}}>s_{i}^{\text {out}}\), we would have found the inverse of the latter fraction, so the maximum value would again be smaller than unity. A possible workaround would be to normalize ϕ via dividing it by ϕ_{max} (Davenport Jr and ElSanhurry 1991; Park et al. 2009; Chmiel et al. 2014).
The ϕ coefficient is related to the conventional χ^{2} statistic in the following way: ϕ^{2}=χ^{2}/S. To determine statistical significance and to find a pvalue for the ϕ coefficient, Ref. Hidalgo et al. (2009) has employed a tdistribution approximation. The conditions for the validity of the approximation are not discussed. Ref. Chmiel et al. (2014) uses a χ^{2} test. Moreover, it is important to note that the χ^{2} test involves underlying assumptions regarding minimum expected cell counts in the 2×2 table (Everitt 1992). In our setting, characterized by Table 2, the expected cell count for the (1,1) cell is \(s_{i}^{\text {out}}s_{j}^{\text {in}}/S\), which is smaller than unity for many existing disease pairs in our data set. Thus the requirements for the χ^{2} test are not always met. In Ref. Park et al. (2009), the distributions of w_{ij} is assumed to be binomial, and a Poisson approximation is used to calculate the pvalues. It is of note that the Poisson approximation to the binomial distribution also requires the expected cooccurrence of the disease pairs not to be large (usually O(1) is advised in statistics textbooks). For about 10% of the disease pairs in our data set, this condition is not met. So this method is also not applicable to our data set. In this paper, we used Fisher’s exact test to decide the significance of association.
Disparity filter
The disparity filter (DF) has a local nodebased approach to define the network null model (Serrano et al. 2009). We first consider unweighted networks for explanatory purposes. The DF method asks that, for node x with given strength s_{x} and degree k_{x}, what would we expect the weights of its links to look like if they were allocated randomly? In the null model of the DF method, each node is assumed to possess a given strength s_{x} that is to be distributed among its k_{x} neighbors uniformly at random. That is, if the null is true, node x has no preference among its neighbors, and it would distribute its weights uniformly at random. We can mathematically conceptualize this setting as follows: In the unit interval [0,1], k−1 points are drawn uniformly at random, thus leading to k shares of strength pertaining to k different links. Without loss of generality, consider the leftmost interval (between 0 and the leftmost randomlychosen point). The probability that the length of this share (which corresponds to the weight of one of the links) is greater than x is equal to the probability that all of the other k−1 points fall in a piece with length 1−x. This probability equals (1−x)^{k−1}. This coincides with the pvalue for that link, because it gives the probability that, under the null, the share of that link exceeds the observed value.
The said procedure can be undertaken either for outdegrees or indegrees, yielding two distinct backbone networks. We define the pvalue:
Then, a global thresholding can be done for a desired level of significance by discarding links whose α values exceed a certain threshold value. In this paper we set α=0.05 as the threshold value. We only consider the outnetwork in this study, because we are concerned mainly with finding diseases that increase the risk of developing other diseases (that is, those who perform as ‘roots’ in comorbidity paths).
Iterative proportional fitting procedure
The Iterative Proportional Fitting Procedure (IPFP) is a simple method used in the context of US intercounty migration flows (Slater 2009a; 2009b). Here we utilize it to analyze disease flows. This method utilizes the SinkhornKnopp algorithm (Sinkhorn and Knopp 1967), which involves iteratively normalizing the rows and columns of the adjacency matrix until the row and column sums are sufficiently close to unity. In the IPFP method, after constructing this bistochastic matrix via successive normalizations, we start from an empty network and add the links successively in the decreasing order of their weight in the bistochastic matrix until the largest connected component of the network comprises every node. One can also use a global thresholding to obtain sparser networks. As the threshold is lowered, more and more links whose values in the bistochastic matrix exceed the threshold are allowed in. In this paper, we retain the top 5% of the heaviest links of the resultant bistochastic matrix.
The mathematical procedure for the IPFP method is as follows. Suppose we seek B, a transformation of the adjacency matrix A between the N distinct diseases, and we impose the condition that every disease in B must have the same number of preceding diagnoses and the same number of succeeding diagnoses as every other disease. If we interpret disease progressions as flows between diseases, this condition is the equality of inflow and outflow for every disease. This fixed amount is arbitrary and can be set to unity, so the elements of B can be interpreted as probabilities. We can also normalize the elements of A by S, that is, we can interpret w_{ij}/S as the fraction of the total interdisease flux that flows from disease i to disease j, so it can be interpreted as a probability. Consider the following minimumcrossentropy estimation problem:
The task is to minimize the KullbackLeibler divergence between the target matrix B and the adjacency matrix A (whose ij element is w_{ij}), given the normalization constraints of rows and columns of B. We denote the Lagrangian multipliers for the first constraint (normalization of rows) by λ_{j} and those of the second constraint (normalization of columns) by μ_{i}. The Lagrangian is:
Setting the components of the gradient equal to zero, we get:
Let Λ be a diagonal matrix with positive elements \({\exp (\lambda _{i}1/2)/\sqrt {S}}\) and let M be a diagonal matrix with positive elements \({\exp (\mu _{j}1/2)/\sqrt {S}}\). Then, according to Eq. 11, the aboveformulated maximum entropy problem becomes equivalent to finding a bistochastic matrix B such that: B=ΛAM, with the following pair of coupled equations holding: \( \Lambda _{ii} = 1/{\sum \nolimits }_{j} A_{ij} M_{jj}\) and \(M_{ii}= 1/{\sum \nolimits }_{i} A_{ij} \Lambda _{ii}\). This is equivalent to what the SinkhornKnopp theorem states: iterating the matrices Λ and M from the latter pair and inserting the limiting result into the equation B=ΛAM, the unique bistochastic matrix B is obtained (Sinkhorn and Knopp 1967). For discussions regarding convergence of the iteration, see Refs. Sinkhorn and Knopp (1967); Chakrabarty and Khanna (2018). In brief, the method converges if and only if the adjacency matrix has total support, which means that for every nonzero element A, there exists a column permutation of A such that the nonzero element is brought to the main diagonal and every diagonal element is nonzero. Note that adding a nonzero constant to every diagonal element of the adjacency matrix (which is equivalent to adding a selflink for every disease) guarantees this property, because for every offdiagonal element we can simply swap its column such that it is brought to the main diagonal, and the main diagonal of the resulting matrix is already allpositive. Such an addition, akin to Laplace smoothing in machine learning (Schütze et al. 2008), is equivalent to viewing each disease as succeeding itself after a prior diagnosis, because any two checks for the same patient during the period of an illness would produce such a selflink. We performed a robustness check regarding the amount added to the main diagonal. We observed reasonable robustness for any added value up to O(10^{1}) for the network measures that we invoked in this paper. So we used unity; we increment the diagonal of the adjacency matrix by unity and then applied the IPFP procedure by iteratively normalizing columns and rows until sufficient convergence.
The IPFP method has another property that facilitates the interpretation of its function. Consider the mth stage of the iterative normalization procedure in the IPFP method. Denote the adjacency matrix at this stage by A^{(m)}. Denote the sum of row i at this stage by \(r_{i}^{(m)}\) and the sum of column i by \(c_{i}^{(m)}\). Denote the adjacency matrix after row normalization by A^{(m+1/2)}, and denote the result of the subsequent columnnormalization by A^{(m+1)}. The element ij of A^{(m+1/2)} is given by
After the subsequent column normalization, we have
Therefore, for two disease pairs ij and kℓ, we have:
Thus the quantity \( \left [ A_{ij}^{(m)} A_{k\ell }^{(m)}\right ]/ \left [A_{i\ell }^{(m)}A_{kj}^{(m)}\right ]\) is conserved. This can be interpreted as the odds ratio in the contingency table formed by the four diseases i,j,k,ℓ. This odds ratio is the same between the final bistochastic matrix and the original raw adjacency matrix. Thus the IPFP method focuses on relative interdisease flows and discards the absolute link weights.
The GloSS filter
We discussed above that the disparity filter had a local approach; focusing on the distribution of link strengths of individual nodes among their immediate inneighbors or outneighbors. That is, the disparity filter assesses how likely a link strength is to be a nonrandom fluctuation in the links of an individual node. An alternative approach would be to allow the weights to be distributed globally, while still retaining the degrees fixed. This leads to the Global Statistical Significance (GloSS) filter (Radicchi et al. 2011). The GloSS filter assesses how likely a particular link weight is to be a nonrandom fluctuation in the whole network. This method works as follows. We first fix the network topology, that is, the node degrees and directions. This unweighted network is the substrate for the null model. Denote the empiricallyobserved weight distribution of links by \(\hat {p}(w)\). The null model is constructed by assigning to each link of the substrate network a value randomly drawn from the global empirical distribution \(\hat {p}(w)\).
We introduce the auxiliary probability distribution F(s,k), which is the probability that randomly drawing k values from the weight distribution \(\hat {p}(w)\) will yield values that sum up to s. It is straightforward to show that F(s,k) is obtained by convolving \(\hat {p}(w)\) with itself k times. More simply, we can take the inverse Fourier transform of the kth power of the Fourier transform of the original distribution:
Now we can apply Bayes’ rule to obtain:
We can thus compute α_{ij}, which is the probability that, under the null, the value of w_{ij} is greater than a given value. So we arrive at the pvalue for the observed weight w_{ij}:
The method then proceeds by retaining those links whose pvalues are smaller than a threshold value which sets the significance level. In this paper, we set α=0.05.
It is of note that the pvalues of the links are not necessarily highly correlated with link weights. This feature enables a tradeoff between topology and weight. That is, the method has the advantage that it can capture informative links whose weights might not be outstandingly large.
Link salience
An alternative approach for assessing the importance of a link is link salience (Grady et al. 2012). The link salience method takes a more global approach as compared to the previous methods. The underlying rationale of this method can be intuitively described with an analogy to road networks: if the network represents the network of roads between locations, then the link salience method is trying to partition the links into superhighways and roads (Wu et al. 2006). This method involves viewing the network from the point of view of every single node, and measuring how important a given link is viewed by all nodes. The algorithm works as follows. For a path {v_{1},…,v_{ℓ}} between source node v_{1} and target node v_{ℓ}, we define the total effective distance as \({\sum \nolimits }_{i=1}^{\ell } 1/w_{v_{i}v_{i+1}}\). So between any pair of nodes, we can define a shortest path as the path that minimizes the effective distances. For each node v, we find the shortest paths to every other node. This can be done by standard methods, such as Dijkstra’s algorithm. The shortestpath tree rooted at node v can be represented by a binary matrix T(v) with the same size as the original network. So T(v)_{ij} is 1 if the link from node i to node j exists on at least one shortest path from v to some other node, and T(v)_{ij} is zero otherwise. Finally, the salience of the network is defined as the following matrix: \(S=(1/N) {\sum \nolimits }_{v} T(v)\). This matrix gives the link salience values to be used for extracting the network skeleton.
There are several advantageous of this method. First, if the salience of a link is high (that is, close to unity), this means that it participates in most of the shortestpath trees. Thus, viewed from the point of view of the majority of the nodes, this link is important for reaching other nodes. This helps one capture pathways that are critical in reaching certain diseases. For instance, a certain link might not be particularly heavy, yet this method might pick it up because it is essential in reaching a certain disease which is otherwise isolated. Second, this method is highly robust regarding the choice of significance threshold. The distribution of link salience values calculated with the above procedure is bimodal: most links have salience concentrated near zero, a small minority have salience concentrated around unity, and the rest of the links take intermediate salience values. Such a bimodal nature of this distribution considerably facilitates the choice of threshold, because the links that fall in the intermediary region between the two peaks are a negligiblysmall minority and any threshold value in this region will retain almost the same set of links. Third, this method enables characterizing the risks associated with links, rather than the nodes. The global nature of this approach enables extraction of ‘disease highways’, which are the main multidisease progression paths. In this paper, we used the threshold value of 0.9 for the salience values. That is, we retain all the links with salience greater than 0.9.
Comparing networks
In this section we first provide a broad overview of the structure of the constructed networks. We then introduce different network measures for characterizing disease importance and apply these measures to the constructed networks. The measures that we use are instrength, outstrength, eigenvector centrality, PageRank, Hubs and Authority (the HITS algorithm (Kleinberg 1999)), and betweenness centrality.
Overview of the function of different networks
Table 3 presents the summary statistics of the networks built with the above methods. For each network, we calculated the number of links, the percentage of links from the raw network that the method retains, and the number of nodes for each network. To describe the connectivity of nodes, we calculate the mean and standard deviation of the outstrengths and outdegrees, which are mathematically identical to those for the instrengths and indegrees, respectively. To characterize the relation between the degrees of nodes and their neighbors, we calculate the nearestneighbor degree statistics as follows. For each node, we find the average outdegree of its outneighbors. Then we calculate the mean and standard deviation of these values across all nodes. This measure quantifies the nearestneighbor degree correlations of the network. For each network, we denote the average prevalence of the diseases kept by the method by 〈P〉. For each disease, we can find the average prevalence of its neighboring diseases. The average over all these average values is denoted by \(P_{nn}^{\text {out}}\) if we use the outneighbors for calculations, and \(P_{nn}^{\text {in}}\) if we use the inneighbors. Finally, we report a measure of homophily, calculated as the correlation between the prevalence of diseases and the average prevalence of the neighbors of diseases. This assortativity coefficient can be calculated using either the outneighbors or the inneighbors. The resulting coefficients are denoted by \(r_{P}^{\text {out}}\) and \(r_{P}^{\text {in}}\), respectively. These measures can also be visibly investigated from Figs. 5 and 6. Every constructed network is a subsample of the raw network topologically (disregarding the link weights). But there is no particular relation between the constructed networks, that is, one cannot be derived from the other, because the methods employ different filtering rationales. The OER, ϕ, and IPFP methods, assign new weights to the interdisease links, which are different from the raw link weights which simply denote the number of diagnosis successions. In contrast, in the DF, Gloss, and Salience methods, filtering techniques are applied only to decide which links of the raw network to retain and which to discard, and the weights of the retained links remain intact.
From Table 3 we can obtain quick insight into the differences between the function of different methods. For the raw network, the correlations between the prevalence of diseases with their inneighbors and that with their outneighbors, denoted by \(r_{P}^{\text {in}}\) and \(r_{P}^{\text {out}}\) respectively, are both negative. This is also evident from the negative slope of the curves in Figs. 5 and 6 corresponding to the raw network. Also, the prevalence distribution of the raw network, presented in Fig. 4, shows that the distribution of the prevalences has a lognormallike shape, so most diseases have small prevalence, and highprevalence diseases are relatively less common. Combining these two observations, we deduce that the connections are strongly disassortative and mutual. That is, the network has a coreperiphery structure in which a few highlyprevalent nodes preferentially connect to many lowprevalence nodes, and vice versa.
The OER and ϕ methods retain comparable portions of the links, but the OER network discards about 10% of the nodes. We manually verified that all the diseases that the OER method discards are infact extremely lowprevalence. This is also visible in Fig. 4, where the left tail of the OER curve begins around 12, falling in the 6th percentile of the prevalence distribution of the raw network. So we observe that the OER network discards all the diseases in the bottom 5 percentile of prevalence. From Figs. 5 and 6, we also observe that the lowprevalence diseases are discarded by the OER method. We also observed that the curves corresponding to the OER method have negative slopes, similar to the raw network. This indicates that the OER method retains the structural coreperiphery property discussed above. For the ϕ network, correlation between the prevalence of diseases with their inneighbors and that with their outneighbors are close to zero. This means that the disease pairs in the ϕ network neither exhibit homophily nor heterophily in prevalence. This is visible in Figs. 5 and 6, where the curves pertaining to the ϕ method have overall slope close to zero.
The DF network is considerably sparser than OER though it retains roughly the same number of nodes. In contrast to the OER and ϕ networks, the DF network exhibits a high imbalance between \(\langle P_{nn}^{\text {out}} \rangle \) and \(\langle P_{nn}^{\text {in}} \rangle \). This means that on average, the average prevalence of outneighbors of nodes is five times greater than the average prevalence of the inneighbors of nodes. This is expected by construction, because the way the DF network was built was to retain disproportionatelylarge outlinks of each node, and discard the outlinks with smaller weights. So for a typical disease, heavier outlinks are systematically selected, and these are the links that typically point to highprevalence diseases. In other words, highprevalence diseases are the ones which appear with high frequency among either the inneighbors or the outneighbors of a typical disease, and the DF method systematically discards light outlinks without doing the same to the inlinks, thereby creating an imbalance. Consequently, highprevalence nodes become more likely to appear among the outneighbors as before. This asymmetry is also evident from the mismatch between the signs of \(r_{P}^{\text {in}}\) and \(r_{P}^{\text {out}}\). Moreover, as it is visible in Fig. 4, the DF method discards diseases with low prevalence almost entirely, which is a feature it shares with the OER method.
The IPFP network is sparser than OER, ϕ, and DF networks. The average prevalence of the diseases that IPFP retains is lower than the other three methods. Same is true for the median of the prevalences. This means that the IPFP filter discards disease with extremely high prevalence. This is also evident from the tail behavior of the distribution of log(prevalence) in Fig. 4, where the tail of the IPFP curve visibly plummets below the other curves. The \(\langle P_{nn}^{\text {out}} \rangle \) and \(\langle P_{nn}^{\text {in}} \rangle \) values of the IPFP network are an order of magnitude smaller than the corresponding values in the OER, ϕ, and DF networks. Thus the coreperiphery feature is nonexistent in the IPFP network, and the prevalences of the disease pairs are on average less unequal than other networks. Moreover, the IPFP method rewards intermediacy. As Fig. 4 shows, the intermediateprevalence diseases constitute a higher fraction of the IPFP method as compared to the raw network.
The GloSS filter has a unique property: the average prevalence is twice as high as the raw network. So the GloSS method has retained disproportionately highprevalence diseases. This is also evident from Fig. 4. The prevalence curve pertaining to the Gloss method is shifted to the far right. In fact, the lowest prevalence in the GloSS network is 1176, which in the raw network, is at the 47 percentile. So the Gloss network discards about half of the diseases, and it suffers from this problem much more severely than OER and DF methods. The values of \(\langle P_{nn}^{\text {out}} \rangle \) and \(\langle P_{nn}^{\text {in}} \rangle \) are also much greater than the average and median values of P. Moreover, \(r_{P}^{\text {in}}\) and \(r_{P}^{\text {out}}\) are strongly negative. These indicate a strong mutual coreperiphery structure. This feature was also present in the raw network as discussed above, but is markedly accentuated by the GloSS method. The GloSS method discards lowprevalence diseases, and retains medium and highprevalence diseases. As Figs. 5 and 6 illustrate, in the remaining network, the outlinks of highprevalence diseases are preferentially towards mediumprevalence diseases, and conversely, the outlinks of mediumprevalence diseases are preferentially towards highprevalence diseases.
The Salience network is the most sparse, retaining less than 2.5% of all the links in the raw data. The Salience network is small by construction because it seeks to retain a small fraction of links that would capture the macro skeleton of the network. Since according to Table 3 the number of links is smaller than the number of nodes in this network, we deduce that the Salience network is disconnected. Via visualization we verified that the Salience network is in fact segregated to disjoint connected components. The Salience method, like the OER, DF, and GloSS methods, is biased towards highprevalence diseases. However, the degree to which the Salience method suffers from this bias is comparable to the OER and DF methods, and is not as severe as the Gloss method. Another notable uniqueness of the Salience method is the great difference between \(\langle P_{nn}^{\text {out}} \rangle \) and \(\langle P_{nn}^{\text {in}} \rangle \), with the former being about 60 times greater than the latter. This was the case for the DF network too, but this ratio was about five, which is a much smaller difference than in the Salience network. To investigate if the high skew of the prevalence distribution is disproportionately affecting the averages, we repeated the calculations using the median instead of average, and we observed the same pattern both for DF and salience networks. This indicates a highlyunequal threelevel structure in the Salience network: the Salience network comprises many locallycoreperiphery substructures, where the peripheral nodes preferentially connect to the core nodes, but the core nodes do not reciprocate. Some of the core nodes connect to other core nodes, and some of them do not. That is, there are three types of nodes in the Salience network: (i) core nodes with high instrength (received by many unilateral links from small peripheral nodes), (ii) nodes with high instrength and intermediate outstrength (the inflow comes form many peripheral nodes, the outflow goes into the core nodes from the first category), and (iii) the small peripheral nodes who do not receive any inflow and form unilateral links towards the former two categories. This topology can be simply conceptualized as follows: consider a star graph with many leafs, all unilaterally linking to the central node. Now suppose we take a fraction of the leafs, and for each of them introduce some new leaf nodes that unilaterally connect to them, turning them into miniauthorities. This structure exhibits the correlation properties observed in Table 3 for the Salience network. We also visually verified this hypothesis. Note that the lack of reciprocation from highprevalence nodes is in contrast to the raw and OER networks, which although highly unequal, comprised mostlybidirectional links.
Applicability of different methods
To summarize, the OER and ϕ networks focus on diseasedisease relationships in the usual statistical way, that is, in the absence of network structure. If the task of a study is to investigate the comorbidity between a certain pair of diseases, then these methods are suitable. The OER network disproportionately discards diseases with low prevalences, and the ϕ network disproportionately discards disease pairs with highlyunequal prevalences. The DF network is better if the questions of comorbidity are being formulated conditional on having developed certain diseases first, and if one wants to compare between the risks of different diseases that succeed the given initial diseases. The DF method has the disadvantage that it discards diseases with low prevalence. The IPFP method has an egalitarian approach which controls for disease prevalence. This method prevents the results from being dominated by highprevalence diseases. It asks if all the diseases had the same inflow and the same outflux, what would be the best estimate of the comorbidity matrix, given the information on the empirical matrix? In other words, the IPFP method investigates comorbidity patterns controlled for individual disease prevalences. The Gloss method assesses link weight fluctuations on a global level. So this method is preferred when the task of the study is to compare comorbidity links globally, not focusing on a particular disease. That is, if the emphasis of the study is on links rather than the nodes. The Salience method is only relevant for disease trajectories globally and is not suitable for studying comorbidity statistics. The Salience method is suitable if, for instance, one would like to investigate the expected distance between certain diseases, that is, how many intermediate diseases it would typically take to develop disease B having developed disease A. The OER, ϕ, and DF methods have a local approach. The Gloss filter and the Salience method have global approaches. The IPFP method has a mesoscale approach.
Example applications
In what follows we calculate several conventional measures of node importance. We investigate the agreement between different networks on the importance profile of diseases. We denote the diseases by 3digit ICD9 codes.
Different measures for node importance
Node Strength. For directed weighted networks, we can use the outstrength and instrength to characterize nodal connectivity, as discussed above. Table 4 presents the top 5 diseases with highest instrength, and Table 5 presents the top 5 diseases with highest outstrength.
Eigenvector centrality. A basic measure to characterize the centrality of nodes in a network is the eigenvector centrality (Bonacich 1987). The basic intuition behind this measure is that important nodes are those that are connected to other important nodes. This yields a selfconsistent linear set of equations that yields the centrality scores of nodes. Table 6 presents the top 5 diseases with highest eigenvector centrality for different networks.
The rankings of top nodes show the differences between the networks constructed by the different methods. It is of note that eigenvector centrality tends to capture nodes with high instrength. The OER and IPFP methods predominantly pick up pregnancyrelated diagnoses, whose comorbidity links (preceding or following other diseases or conditions) to several different categories of diseases are wellresearched (Desai et al. 2007; James et al. 2005; Brabin et al. 2001; Kittner et al. 1996). The major difference between pregnancyrelated ICD codes (630679) and other categories is cohesion. As shall be discussed below in Tables 19 and 20, the aggregated instrength and outstrength of this category of nodes is not high as compared to other categories (it ranks among the bottom 5 in both cases), but interestingly, in terms of the weights of withincategory links, this category has an outstandingly large share (it ranks second, after the diseases of the circulatory system). This means that the pregnancyrelated nodes form a cohesive subnetwork. This dense cliquelike structure is comprised of disease with intermediatelevel prevalence with prevalence values all relatively close to one another. The OER coefficient is high for these disease pairs because in addition to high cooccurrence, all diseases have intermediate levels of prevalence, so the overestimation and underestimation tendencies of the OER method are not encountered. Another family of diseases that frequently have high values of eigenvector centrality in all network construction methods is the family of 28X diseases, which are the diseases of the blood and bloodforming organs. Notable diseases in this family include different types of anemia, Haemophilia, and diseases of white blood cells. Iron deficiency anemias (280) and ‘Other and unspecified anemias’ (285) appear consistently higher than Haemophilia. This is mainly caused by the high prevalence of 280 and 285 anemias (with prevalences 40,000 and 107,000, respectively), as compared to Haemophilia, whose prevalence is about 10,000 in our data set.
Table 7 presents the mutual correlation coefficients for the eigenvector centrality of nodes computed for different networks. The IPFP method seems uncorrelated or negatively correlated with the other methods. Most methods are strongly correlated in terms of eigenvector centrality. The Salience method focuses on trajectories and distorts the degrees, so the eigenvector centrality of this method is weakly correlated with that of the other methods. The IPFP method changes the strengths and assigns new link weights after controlling for disease prevalences. The IPFP method has a strong negative association with the GloSS method, because by construction, the GloSS method rewards highprevalence diseases and the IPFP method does the converse.
PageRank. The second centrality measure that we use is the PageRank (Brin and Page 1998). The PageRank algorithm was originally used to characterize the importance of websites. This algorithm basically quantifies the likelihood that a person clicking randomly on links will arrive at a given website (Xing and Ghorbani 2004). This method simulates random walks on the network, with a damping factor that characterizes the probability that the walk terminates at any step and restarts at a node chosen uniformly at random. We set the damping factor equal to 0.85, which is conventional in the literature. Table 8 presents the results for the top 5 diseases with highest PageRank for different networks. An intuitive approximation is that PageRank tends to focus on nodes with high instrength (Fortunato et al. 2006). This is confirmed by comparing Table 8 with Table 4; many of the top nodes are common between these two tables. Comparing Table 8 with Table 6, we observe that IPFP is returning different results—mostly conditions pertaining to the perinatal period. We point out that most of these diseases are not particularly highlyconnected nodes in the raw network. Motivated by this observation, we can gain intuition about how IPFP works by noting that, when normalization of rows or columns is performed at each stage, nodes with large degrees can lose their weight if their neighbors are also of large degree. For example, consider the rownormalization step, where rows (and similarly columns) are normalized to add up to unity. At this step, a row that corresponds to a node with outdegree 100 who is connected with heavy and equal outlinks to its 100 outneighbors will become identical to a row corresponding to a node with outdegree 3 who is connected with light and equal outlinks to its 3 outneighbors. In other words, this method is capturing something that is inherently of a different nature than every other method considered here. Table 9 presents the correlation values for PageRank scores in different networks. Similar to the case of eigenvector centrality, IPFP is negatively correlated with every other method. This observation highlights the difference between IPFP and other methods, and invites more investigation into what this method extracts.
Table 9 is also informative regarding the function of different methods. The PageRank scores of the raw network have a strong positive correlation with those in the ϕ, DF, and GloSS networks. The IPFP method shows a negative correlation, similar to the case of eigenvector centrality, because it punishes diseases with high prevalence and assigns a low strength to them, as discussed before. The Salience method constructs a network comprising entirely of nonmutual links and distorts the instrength and outstrength patterns. The instrength of most nodes are mapped to zero, because, as we discused above, the Salience method comprises mostly of highlyunequal substructures with peripheral nodes unidirectionally connected to core nodes. As mentioned above, the PageRank scores of nodes can be approximated with their instrength (Fortunato et al. 2006). Since the Salience method maps the indegrees of many nodes to zero (retaining only their outlinks to highlyprevalent core nodes, as discussed above), the concomitant distortion in instrengths results in a lower correlation between the PageRank of the Salience method and other networks.
Hubs and authority. An alternative way we could characterize the importance of nodes in terms of inflow and outflow is to employ the HITS algorithm (Kleinberg 1999). The HITS algorithm is a simple and intuitive method that was originally devised to characterize the rankings of websites. This algorithm focuses on simultaneously finding hubs and authorities on the web. In that context, a hub is a website that is influential in directing users towards other highlyranked websites, and an authority is a website which gets directed to by highlyranked nodes. In the context of diseases, a hub would be a disease which, if developed, increases the risks of developing other diseases. An authority would be a disease that follows many other diseases. Table 10 presents the results for the top 5 diseases in terms of hubness in different networks. Comparing the results in Table 10 with those of Table 6, we observe that the hubness scores for all categories correlate highly with those of eigenvector centrality, except DF and Salience. Table 11 presents the correlation matrix for hubness. The hubness score is important in comorbidity studies because the hubs that the HITS algorithm nominates are universal senders, and in the context of comorbidity studies, these would pertain to diseases that substantially increase the risk of many other diseases, demanding more prevention and care. Table 11 shows that there is good agreement between the hub scores of different methods, so despite their structural differences, the hub score is robust and can be reliably used. The two usual exceptions are present here as well: the IPFP method, and the Salience method. These are expected because their tasks are different: the hubness scores of the IPFP method pertain to an alternative inflowoutflow comorbidity matrix where the prevalences are controlled for, and the Salience method only focuses on distances and trajectories rather than actual diseasedisease relations. Table 12 presents the top 5 nodes for authority, and Table 13 presents the correlation matrix for authority scores. For the authority index too, there is good agreement between every method except IPFP and Salience.
Betweenness centrality. Distancebased network measures capture different aspects of how essential a node is in the reachability between other pairs of diseases. Here we use betweenness centrality which characterizes the number of shortest paths between different disease pairs that pass through each given disease. There might be a disease that separates a dense module of diseases from the whole network, such that one would develop the diseases within the module only when one first develops this gatekeeper disease. Or, conversely, after developing a disease within the module, subsequent diseases outside the module occur typically after this gatekeeper disease is developed. Such a disease would have a high betweenness centrality. Another type of nodes that are typically characterized by high betweenness centrality are the core nodes in strong coreperiphery structures, because to go from one peripheral node to another, one has to pass through the core nodes. So this measure is helpful in detecting this structural feature of diseases. Table 14 presents the top 5 nodes with highest betweenness centrality in the constructed networks. In the raw, OER, ϕ, DF, and Gloss networks, the top nodes are those with extremely high prevalence. The abovediscussed coreness underpins the high betweenness centrality of these nodes. The Salience method assigns high betweenness centrality scores to conditions pertaining to the perinatal period, which is potentially related to the first typical case of high betweenness mentioned above. That is, certain prenatal conditions act as gatekeeping conditions between conditions before birth and conditions after birth. The results of the IPFP method are less intuitive, consistent with the results for the previous measures. Table 15 presents the correlations across different networks.
Example application: the role of disease prevalence
To get a better intuition into the network measures that we used to characterize diseases, we investigate the correlation between each network measure and disease prevalence in every constructed network. The results are presented in Table 16. In the raw network, every measure has a strong positive association with disease prevalence. Thus highlyprevalent diseases such as diabetes and hypertension receive a high score no matter the network measure used to characterize the diseases. Same is true for the DF and GloSS networks. In contrast, for the IPFP network, prevalence is negatively correlated with every network measure except hub and authority, and for the latter two the association is close to zero. For the Salience network, the correlation between the hub index and prevalence is negative. Together with the positivity of the correlation between the authority index and prevalence, this indicates that in the Salience network, highprevalence nodes disproportionately receive inflows and do not reciprocate. This is in contrast with every other method, where both of these correlations are positive, meaning that highprevalence nodes are characterized by both large inflows and outflows.
Note that the exact value of prevalence for each disease cannot be recovered from the networks alone. This is due to the existence of disease progression trajectories with length greater than two. If every patient had a registry of form A →B, that is, only two diseases, then the prevalence of each disease would be simply the sum of the outstrength and instrength of its corresponding node in the raw network. But because many instances of higherorder records such as A →B→C exist, and those with greater lengths, the prevalence information is lost. If we conceptualize the weighted links in the raw network as distinct links with unit weight, then in this picture, each link would represent exactly one patient if every disease trajectory had length one (that is, in the form A →B). But due to the presence of higherorder trajectories, more than one link can together pertain to a single patient, thus the prevalence information lost. However, Table 16 shows a strong correlation between the outdegree of diseases (about 0.95) and their prevalence. So if we did not have the prevalence data, we could use outdegree as a proxy for prevalence. It would be interesting to investigate if this correlation pattern between various network measures and prevalence would be replicated using data from other regions of the globe.
Example application: shared genes and proteinprotein Interactions
We can use the results to investigate the relation to previous studies on disease networks, which in addition to comorbidity observations, incorporate other diseasedisease linkages into the analysis. For example, in Ref. Park et al. (2009), the protein–protein interaction (PPI) and coexpression networks and the interdisease network of shared genes are compared to the comorbidity records from US Medicare claims. Many observed comorbidity patterns are therefore linked to the shared PPIs and the shared genes of the diseases. For example, the significant comorbidity between Alzheimer’s disease and myocardial infarction is linked to their shared ACE and APOE genes. As another example, a PPI between the genes associated with the autonomic nervous system disorder and the carpal tunnel syndrome (IKBKAP and TTR, respectively) is suggested to contribute to the statisticallysignificant comobrnidity between these two diseases. We can investigate how the results of Ref. Park et al. (2009) are reflected in our constructed networks. Because of the linkfocused nature of these results, we expect the local methods to be relevant here, which include the OER measure, the ϕ coefficient, and the DF method. In our data set, there are 747 cases where the Alzheimer’s disease is diagnosed following myocardial infarction. All three methods deem this link as significant. There were 506 cases where, conversely, Alzheimer’s disease preceded. Only the DF method deemed this direction of the link as significant, which indicates that myocardial infarction attracts a significant share of the outstrength of the Alzheimer’s disease. This implies that conditional on having developed the Alzheimer’s disease, there is an elevated chance of later developing myocardial infarction. There were 38 cases were autonomic nervous system disorder is diagnosed prior to the carpal tunnel syndrome, and OER and ϕ deem this link as significant. There are 47 cases were the carpal tunnel syndrome precedes, and neither of the methods deem this direction of the link as significant. This highlights the strength of the directed characterization of the network over the undirected versions considered in the literature, because in addition to association between disease pairs, the distinction between the statistical properties of the two directions sheds light on which of the two diseases is more probable to cause the other, or at least to precede the other in the causal network that subsumes them both besides other covariates.
Table 17 pertains to disease pairs that have OER>1.5 and are related via shared PPI or genes as deemed by Ref. Park et al. (2009). The table presents the percentage of such disease pairs that are deemed significant by different constructed networks. As expected, OER and ϕ have the best performance, and DF is also performing well. These three methods have a local focus, therefore, they are potent in detecting such linkbased relations. The other methods, however, focus more on the global structure of the network, and as Table 17 shows, have poor performance for detecting such disease pairs, which matches the expectation.
Example application: negative comorbidity and protective effects
As mentioned above, the ϕ coefficient and the OER measure can capture negative comorbidities, that is, cases where developing disease A is negatively associated with developing disease B. We can use the ϕ correlation to detect strong negative comorbidities, minding that distinguishing actual protection effects from mere negative associations obviously requires more rigorous causal analysis and is not within the scope of this paper. Throughout this paper, we only investigate associations. Here we consider several existing examples in the literature where such negative association has been suggested, and to check if our data replicates these findings. A famous example is the negative association between Alzheimer’s disease and various types of cancer (see Ref. Ma et al. (2014); TabarésSeisdedos et al. (2011) and references therein). We focus on the ICD9 codes 140 to 239, which pertain to the neoplasms category, and investigate how the ϕ and OER methods perform at capturing negative comorbidities. For the OER method, we look for significant links with OER < 1, and for the ϕ method, we look for significant links with ϕ<0. The ϕ method gives 29 distinct neoplasms exhibiting significant negative comorbidity with Alzheimer’s disease. The OER method detects 26 of those 29 links, and finds no links that the ϕ method had not. From these two sets, 27 neoplasms are deemed as significant by both methods. In the converse direction, where the initial diagnosis of a neoplasm is associated with reduced risk of subsequent Alzheimer’s disease, the ϕ method detects 30 neoplasms, and the OER detects a subset of them comprising 28 neoplasms. For the Alzheimer’sneoplasm direction of negative comorbidities, the strongest negative association that the ϕ method returns pertains to secondary/unspecified malignant neoplasm of lymph nodes, and for the OER method, the strongest negative association pertains to benign neoplasm of kidney. In the neoplasmAlzheimer’s direction, the first rank for the ϕ method is secondary malignant neoplasm of respiratory and digestive systems, and for the OER method the first rank belongs to malignant neoplasm of gallbladder and extrahepatic bile ducts.
Example application: pregnancyrelated codes
As discussed above, pregnancyrelated codes are among the highlycomorbid codes in several constructed networks. We can use the different constructed networks to investigate these comorbidities. Most codes in the 630679 range are highly connected to one another. Here we focus on comorbidity with codes outside this category. We consider ICD9 code 634 (spontaneous abortion) as an example. In the OER network, the diseases with the highest correlations that tend to precede 634 include 792 (nonspecific abnormal findings in other body substances), 282 (hereditary hemolytic anemias) and 218 (uterine leiomyoma). These links are also deemed significant by the ϕ method. These are in agreement with previous results in the literature (Serjeant et al. 2004; Pajor et al. 1993; Powars et al. 1986; Coronado et al. 2000; Klatsky et al. 2008). The only link OER, ϕ, DF, and IPFP methods all agree on is 792. Figure 7 shows the local graph of pregnancy related codes. As stated above, pregnancyrelated codes are highly cohesive and each of them connects to many others. For better visualization, we sparsified the neighborhood networks as follows. For each node, we only retained the two pregnancyrelated neighbors with highest weights and the two nonpregnancy codes with highest weights. In Fig. 7, the blue links are between pregnancy codes and the orange links are between pregnancy and nonpregnancy codes. We used the ϕ network. to generate this graph. Highlyconnected nodes in this subnetwork are iron deficiency anemia and other types of anemias (Serjeant et al. 2004; Pajor et al. 1993; Powars et al. 1986) and Cholelithiasis (Ács et al. 2009; Dixon et al. 1987).
Example application: insight from coarsegrained networks
We now construct a coarsegrained picture of the disease network based on the standard 17 categories of the ICD9 coding scheme (World Health Organization 2004). The list of the 17 categories are presented in Table 18, along with the number of 3digit ICD9 codes contained within each category, percentage of 3digit ICD9 codes contained within each category, number of diagnoses in the data set that pertain to diseases within each category, and the percentage of such diagnoses.
The network properties of the 17node coarsegrained networks are summarized in Table 19, which presents the percentage of the total link weight that flows into and out of each category. Table 20 presents the percentages of the total link weight of the network that falls within each disease category, that is, pertaining to links that connect two nodes that both belong to the same disease category. In the raw network, the highest selfflow belongs to category 7 (disease of the circulatory system), with almost 4% of the total link weight contained inside it. In every method except OER and IPFP, a comparatively high fraction of the total link weight of the network flows within category 7. This category includes, most notably, hypertension, cardiovascular disease, and Ischemic heart disease.
The second highest selfflow in the raw network belongs to category 11 (complications of pregnancy, childbirth, and the puerperium). The OER network changes some of the selfflows markedly. In the OER network, category 11 has an outstandingly high selfflow. Its selfflow is almost the same as the selfflow of every other category combined. So the OER method assigns a high value to disease pairs that both belong to this category. The IPFP method also assigns an outstandingly high selfflow to this category. So, controlling for prevalence, diagnosis pairs that both belong to category 11 typically receive a high weight in the IPFP network. The Salience method assigns the most outstanding selfflow to category 11. In the Salience network, selfflow of category 11 exceeds the selfflows of every other category combined. This means that these diseases are highly central in the diseasedisease trajectories, and the links between these diseases contribute to many shortest paths between every disease pair in the whole network.
In Table 21, we present the rankings of the disease categories in terms of what fraction of the total link weight of the network flows out of each category. Except for the IPFP network, the highest outstrength of all networks either belongs to category 11 (complications of pregnancy, childbirth, and the puerperium) or category 7 (disease of the circulatory system). Moreover, category 3 (endocrine/nutritional/metabolic/immunity disorders) is consistently high in outstrength across different networks.
Example application: comorbidity with the neoplasm category
Many studies in the literature have demonstrated comorbidity patterns between different neoplasms and various other diseases (Mazza and Mitchell 2017; van Baal et al. 2011; Piccirillo et al. 2004; Piccirillo 2000; Fleming et al. 1999; West et al. 1996; Yancik et al. 1996). We can use the coarsegrained networks to investigate neoplasmrelated comorbidity patterns. Since we are conditioning on neoplasm being the first diagnosed disease in comorbidity pairs, we can use the DF network. In the coarsegrained DF network, the outlinks with the highest weights emanated from the neoplasm category are those to node 7 (diseases of the circulatory system, with 21% of the total link weight of the network), and node 3 (endocrine/nutritional/metabolic/immunity disorders, with 10% of the total link weight). Same is true for the OER and ϕ networks; the top two disease categories that follow neoplasms are 7 and 3.
Conclusion and future work
In this paper we provided a brief summary of some of the main existing methods in the network science literature that could be utilized to construct disease comorbidity networks from longitudinal hospital data. We showed that these methods capture different aspects of the comorbidity patterns, and one must note their properties and choose them according to the structural feature of interest. We presented several examples of the applications of these methods in studying different comorbidity relations, both for single diseases and disease groups. Methodological work in this domain is inchoate, similar to the field of network medicine itself. So there are many interesting unexplored problems of practical significance. Below we highlight a few of such problems.
As discussed above, there are many cases in which a patient visits the hospital and multiple diseases are diagnosed and stored in the data set for the same visit. The temporal direction of such links is lost. We chose to discard such links and refrained from introducing noise to the data set by counting them as bidirectional. An interesting problem would be to infer the direction of these undirected links using edge recovery algorithms (Martin et al. 2016) (not to be confused with link prediction, where the task is to predict the existence of empiricallyabsent links). which can also be done as a byproduct of communitydetection algorithms (Martin et al. 2016). One has to first devise an inference method which is applicable to weighted directed graphs.
In the construction of the networks discussed in the text, we did not use the covariates, such as age and gender. The interplay between such covariates and structure is worth serious investigation. Manual investigation, such as separating the data for different sexes (such as in Ref. Chmiel et al. (2014); Jensen et al. (2014)) does provide insight, yet a systematic and algorithmic approach would be an interesting research problem. For example, it would be interesting to formulate the macro/meso properties of the comorbidity network, and the local properties of individual diseases, as a function of age. This would be an example of link metadata: a disease would be connected to another disease via multiple links whose metadata (age/sex) are different. In this case, unlike what we did in the present paper, one must not aggregate all the link weights into a single weight. Though methods that incorporate nodal metadata exists (for example, for community detection (Newman and Clauset 2016; Peel et al. 2017)), we are not aware of a systematic investigation for linkbased metadata. A simpler approach for the age variable would be to divide the age variables into discrete categories to construct a multiplex network, where each layer represents an age group. Then methods for multiplex network analysis can be applied (e.g., those of community detection (De Bacco et al. 2017)).
Perhaps more important would be to investigate multimorbidity patterns. This would be the first step towards controlling for age. That is, if instead of comorbidity pairs, we limit the analysis to sets of, say, four diseases, the analysis would differ. We would look for the number of instances that the ordered set of diseases A−B−C−D have been diagnosed in the dataset. This probably evinces a more meaningful connection between these diseases, as opposed to only considering pairs separately. In other words, if there are many instances of A−B−C−D, this is probably more informative regarding the linkages of these diseases as compared to observing many A−B and B−C and C−D pairs separately. Considering longer chains in such a manner is more likely to capture actual disease progression pathways, because it reduces the likelihood that, for example, A simply happens to be a disease that disproportionately occurs in early ages, and D at later ages.
Another interesting direction forward would be to incorporate death. One could add death as a new disease. We would obviously have \(s_{\text {death}}^{\text {out}}=0\), but it would be instructive to study the inlinks and the position of death in the disease network. More particularly, different diseases can be characterized as their distance to this reference node. The first step towards incorporating death into the disease network would be to check the relative network distance of certain diseases, or disease categories, to the death node, which characterizes how deadly those diseases or disease categories are. This also enables focusing the analyses on diseases that tend to appear late, and to characterize their relation to death, and to investigate if they have special properties in this regard.
In addition to death, a more complete analyses would require the addition of a ‘noise’ node, which would represent unknown causes. Because currently, when a disease is diagnosed without any predecessor, it receives no inlink. This limits the analytical power to investigate the importance of unknown causes, which characterizes the likelihood that a healthy person would enter the disease comorbidity network. In other words, the inflow of the network (which represent new patients) all enter through this root node, whence they flow throughout the rest of the network.
Finally, we remark upon the importance of distinguishing between the structural features of acute and chronic diseases in the comorbidity patterns. The temporal profile of comorbidity of disease A, if A is a chronic disease, looks like: {A}→{A,B}→{A,B,C}→…, because chronic diseases tend to persist, by definition. An alternative pattern could be {A}→{A,B}→{A,C}→…. On the contrary, if A is an acute disease, then the temporal pattern of comorbidities would look like {A}→{B}→{C}→…. These patters presumably undergird different mechanisms of comorbidity, because there is a difference between having a certain disease and having a history of it. Thus a worthwhile problem to study would be the algorithmic characterization of the chronicness of diseases and the persistence of their damages, based on their structural properties in the comorbidity network.
Abbreviations
 CMA:

Census metropolitan area
 DF:

Disparity filter
 GloSS:

Global statistical significance
 ICD:

International classification of diseases
 IPFP:

Iterative proportional fitting procedure
 OER:

Observedtoexpected ratio
 PPI:

Proteinprotein interaction
 RR:

Relative risk
References
Ács, N, Bánhidy F, Puhó EH, Czeizel AE (2009) Possible association between symptomatic cholelithiasiscomplicated cholecystitis in pregnant women and congenital abnormalities in their offspring—a populationbased case–control study. Eur J Obstet Gynecol Reprod Biol 146(2):152–155.
Bonacich, P (1987) Power and centrality: A family of measures. Am J Sociol 92(5):1170–1182.
Bouma, G (2009) Normalized (pointwise) mutual information in collocation extraction In: Proceedings of Biannial GSCL Conference, 31–40, Postdam.
Brabin, BJ, Hakimi M, Pelletier D (2001) An analysis of anemia and pregnancyrelated maternal mortality. J Nutr 131(2):604–615.
Brin, S, Page L (1998) The anatomy of a largescale hypertextual web search engine. Comput Netw ISDN Syst 30(1–7):107–117.
Chakrabarty, D, Khanna S (2018) Better and simpler error analysis of the sinkhornknopp algorithm for matrix scaling. arXiv preprint arXiv:1801.02790.
Chmiel, A, Klimek P, Thurner S (2014) Spreading of diseases through comorbidity networks across life and gender. New J Phys 16(11):115013.
Clauset, A, Shalizi CR, Newman ME (2009) Powerlaw distributions in empirical data. SIAM Rev 51(4):661–703.
Coronado, GD, Marshall LM, Schwartz SM (2000) Complications in pregnancy, labor, and delivery with uterine leiomyomas: a populationbased study. Obstet Gynecol 95(5):764–769.
Davenport Jr, EC, ElSanhurry NA (1991) Phi/phimax: review and synthesis. Educ Psychol Meas 51(4):821–828.
De Bacco, C, Power EA, Larremore DB, Moore C (2017) Community detection, link prediction, and layer interdependence in multilayer networks. Phys Rev E 95(4):042317.
Desai, M, Ter Kuile FO, Nosten F, McGready R, Asamoa K, Brabin B, Newman RD (2007) Epidemiology and burden of malaria in pregnancy. Lancet Infect Dis 7(2):93–104.
Dixon, NP, Faddis DM, Silberman H (1987) Aggressive management of cholecystitis during pregnancy. Am J Surg 154(3):292–294.
Everitt, BS (1992) The Analysis of Contingency Tables. Chapman and Hall/CRC, London.
Fleming, ST, Rastogi A, Dmitrienko A, Johnson KD (1999) A comprehensive prognostic index to predict survival based on multiple comorbidities: a focus on breast cancer. Med Care 37(6):601–614.
Folino, F, Pizzuti C, Ventura M (2010) A comorbidity network approach to predict disease risk In: Information Technology in Bioand Medical Informatics, ITBAM 2010, 102–109.. Springer, Heidelberg.
Fortunato, S, Boguñá M, Flammini A, Menczer F (2006) Approximating pagerank from indegree In: International Workshop on Algorithms and Models for the WebGraph, 59–71.. Springer.
Goh, KI, Cusick ME, Valle D, Childs B, Vidal M, Barabási AL (2007) The human disease network. Proc Natl Acad Sci 104(21):8685–8690.
Grady, D, Thiemann C, Brockmann D (2012) Robust classification of salient links in complex networks. Nat Commun 3:864.
Guilford, JP (1965) The minimal phi coefficient and the maximal phi. Educ Psychol Meas 25(1):3–8.
Halu, A, De Domenico M, Arenas A, Sharma A (2017) The multiplex network of human diseases. bioRxiv:100370.
Hidalgo, CA, Blumm N, Barabási AL, Christakis NA (2009) A dynamic network approach for the study of human phenotypes. PLoS Comput Biol 5(4):1000353.
James, AH, Bushnell CD, Jamison MG, Myers ER (2005) Incidence and risk factors for stroke in pregnancy and the puerperium. Obstet Gynecol 106(3):509–516.
Jensen, AB, Moseley PL, Oprea TI, Ellesøe SG, Eriksson R, Schmock H, Jensen PB, Jensen LJ, Brunak S (2014) Temporal disease trajectories condensed from populationwide registry data covering 6.2 million patients. Nat Commun 5:4022.
Jeong, E, Ko K, Oh S, Han HW (2017) Networkbased analysis of diagnosis progression patterns using claims data. Sci Rep 7(1):15561.
Katz, D, Baptista J, Azen S, Pike M (1978) Obtaining confidence intervals for the risk ratio in cohort studies. Biometrics 34(3):469–474.
Kittner, SJ, Stern BJ, Feeser BR, Hebel JR, Nagey DA, Buchholz DW, Earley CJ, Johnson CJ, Macko RF, Sloan MA, et al. (1996) Pregnancy and the risk of stroke. N Engl J Med 335(11):768–774.
Klatsky, PC, Tran ND, Caughey AB, Fujimoto VY (2008) Fibroids and reproductive outcomes: a systematic literature review from conception to delivery. Am J Obstet Gynecol 198(4):357–366.
Kleinberg, JM (1999) Authoritative sources in a hyperlinked environment. J ACM 46(5):604–632.
Lee, DS, Park J, Kay K, Christakis N, Oltvai Z, Barabási AL (2008) The implications of human metabolic network topology for disease comorbidity. Proc Natl Acad Sci 105(29):9880–9885.
Liu, YI, Wise PH, Butte AJ (2009) The“ etiome”: identification and clustering of human disease etiological factors. BMC Bioinformatics 10(2):14.
Ma, LL, Yu JT, Wang HF, Meng XF, Tan CC, Wang C, Tan L (2014) Association between cancer and alzheimer’s disease: systematic review and metaanalysis. J Alzheimers Dis 42(2):565–573.
Martin, T, Ball B, Newman ME (2016) Structural inference for uncertain networks. Phys Rev E 93(1):012306.
Mazza, D, Mitchell G (2017) Cancer, ageing, multimorbidity and primary care. Eur J Cancer Care 26(3):12717.
Menche, J, Sharma A, Kitsak M, Ghiassian SD, Vidal M, Loscalzo J, Barabási AL (2015) Uncovering diseasedisease relationships through the incomplete interactome. Science 347(6224):1257601.
Newman, ME, Clauset A (2016) Structure and inference in annotated networks. Nat Commun 7:11863.
Pajor, A, Lehoczky D, Szakacs Z (1993) Pregnancy and hereditary spherocytosis. Arch Gynecol Obstet 253(1):37–42.
Park, J, Lee DS, Christakis NA, Barabási AL (2009) The impact of cellular networks on disease comorbidity. Mol Syst Biol 5(1):262.
Peel, L, Larremore DB, Clauset A (2017) The ground truth about metadata and community detection in networks. Sci Adv 3(5):1602548.
Piccirillo, JF (2000) Importance of comorbidity in head and neck cancer. Laryngoscope 110(4):593–602.
Piccirillo, JF, Tierney RM, Costas I, Grove L, Spitznagel Jr EL (2004) Prognostic importance of comorbidity in a hospitalbased cancer registry. Jama 291(20):2441–2447.
Powars, DR, Sandhu M, NilandWeiss J, Johnson C, Bruce S, Manning PR (1986) Pregnancy in sickle cell disease. Obstet Gynecol 67(2):217–228.
Radicchi, F, Ramasco JJ, Fortunato S (2011) Information filtering in complex weighted networks. Phys Rev E 83(4):046101.
Schütze, H, Manning CD, Raghavan P (2008) Introduction to Information Retrieval vol. 39. Cambridge University Press, UK.
Serjeant, GR, Loy LL, Crowther M, Hambleton IR, Thame M (2004) Outcome of pregnancy in homozygous sickle cell disease. Obstet Gynecol 103(6):1278–1285.
Serrano, MÁ, Boguná M, Vespignani A (2009) Extracting the multiscale backbone of complex weighted networks. Proc Natl Acad Sci 106(16):6483–6488.
Sinkhorn, R, Knopp P (1967) Concerning nonnegative matrices and doubly stochastic matrices. Pac J Math 21(2):343–348.
Slater, PB (2009a) A twostage algorithm for extracting the multiscale backbone of complex weighted networks. Proc Natl Acad Sci 106(26):66–66.
Slater, PB (2009b) Multiscale network reduction methodologies: Bistochastic and disparity filtering of human migration flows between 3,000+ us counties. arXiv preprint arXiv:0907.2393.
TabarésSeisdedos, R, Dumont N, Baudot A, Valderas JM, Climent J, Valencia A, CrespoFacorro B, Vieta E, GómezBeneyto M, Martínez S, et al. (2011) No paradox, no progress: inverse cancer comorbidity in people with other complex diseases. Lancet Oncol 12(6):604–608.
van Baal, PH, Engelfriet PM, Boshuizen HC, van de Kassteele J, Schellevis FG, Hoogenveen RT (2011) Cooccurrence of diabetes, myocardial infarction, stroke, and cancer: quantifying age patterns in the dutch population using health survey data. Popul Health Metrics 9(1):51.
West, DW, Satariano WA, Ragland DR, Hiatt RA (1996) Comorbidity and breast cancer survival: a comparison between black and white women. Ann Epidemiol 6(5):413–419.
World Health Organization (2004) International statistical classification of diseases and related health problems.
Wu, Z, Braunstein LA, Havlin S, Stanley HE (2006) Transport in weighted networks: partition into superhighways and roads. Phys Rev Lett 96(14):148702.
Xing, W, Ghorbani A (2004) Weighted pagerank algorithm In: Proceedings of the second annual conference on communication networks and services research CNSR 2004, 305–314.. IEEE, Fredericton, N.B.
Yancik, R, Havlik RJ, Wesley MN, Ries L, Long S, Rossi WK, Edwards BK (1996) Cancer and comorbidity in older patients: a descriptive profile. Ann Epidemiol 6(5):399–412.
Yıldırım, MA, Goh KI, Cusick ME, Barabási AL, Vidal M (2007) Drug—target network. Nat Biotechnol 25(10):1119–1126.
Zhou, X, Menche J, Barabási AL, Sharma A (2014) Human symptoms–disease network. Nat Commun 5:4212.
Funding
This work was funded in part by the James S. McDonnell Foundation (BF, NM, and MAR).
Availability of data and materials
In order to access the data, we were required to obtain approval from our institutional review board and from the Commission d’accès à l’information du Québec. A condition of our access was that we are not permitted to share the data with other researchers. However, researchers can request access to the same data directly from the data holder, Régie de l’assurance maladie du Québec (RAMQ), by following the process outlined on their website:
www.ramq.gouv.qc.ca/fr/donneesetstatistiques/donneessurdemande/Pages/ donneesdemande.aspx
Author information
Authors and Affiliations
Contributions
All authors contributed to every aspect of the manuscript. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
About this article
Cite this article
Fotouhi, B., Momeni, N., Riolo, M. et al. Statistical methods for constructing disease comorbidity networks from longitudinal inpatient data. Appl Netw Sci 3, 46 (2018). https://doi.org/10.1007/s4110901801014
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s4110901801014
Keywords
 Weighted networks
 Null model
 Comorbidity
 Disease networks
 Centrality