Skip to main content

Statistical methods for constructing disease comorbidity networks from longitudinal inpatient data


Tools from network science can be utilized to study relations between diseases. Different studies focus on different types of inter-disease linkages. One of them is the comorbidity patterns derived from large-scale longitudinal data of hospital discharge records. Researchers seek to describe comorbidity relations as a network to characterize pathways of disease progressions and to predict future risks. The first step in such studies is the construction of the network itself, which subsequent analyses rest upon. There are different ways to build such a network. In this paper, we provide an overview of several existing statistical approaches in network science applicable to weighted directed networks. We discuss the differences between the null models that these models assume and their applications. We apply these methods to the inpatient data of approximately one million people, spanning approximately 17 years, pertaining to the Montreal Census Metropolitan Area. We discuss the differences in the structure of the networks built by different methods, and different features of the comorbidity relations that they extract. We also present several example applications of these methods.


In the last decade, several network approaches have been introduced to study the interrelations between human diseases. Networks are constructed by connecting diseases that share certain features, collapsing a bipartite graph into a unipartite graph. Examples include genetic/interactomic association (Goh et al. 2007; Halu et al. 2017; Menche et al. 2015), similarity of symptoms (Zhou et al. 2014; Halu et al. 2017), similarity of pertinent drugs (Yıldırım et al. 2007), commonality of etiological environmental factors associated with diseases (Liu et al. 2009), adjacency of metabolic reactions catalyzed by corresponding mutated enzymes (Lee et al. 2008), and co-occurrence in patients (Hidalgo et al. 2009; Folino et al. 2010; Chmiel et al. 2014; Jensen et al. 2014; Jeong et al. 2017). Also sometimes more than one of these networks are juxtaposed to build a multiplex characterization (Halu et al. 2017). All of these strands of research are beneficial and insight-engendering in their respective contexts, and the increase in the breadth of topics and the diversity of approaches promises the emergence of a new field of research.

Here we focus on a methodological problem in this new field. We investigate different statistical methods for defining a weighted and directed co-morbidity network from longitudinal hospital in-patient data, and show that different methods capture different aspects of co-morbidity relations. We use a data set containing over a million people for a period of approximately 17 years, and employ different statistical methods to extract co-morbidity networks based on this data set.

Some of the previous studies have used a binary version of the comorbidity networks to study the structural properties of diseases (Hidalgo et al. 2009; Folino et al. 2010; Chmiel et al. 2014; Jeong et al. 2017). Measures for establishing unweighted binary links between disease pairs include the ϕ-correlation (which is closely linked to the χ2 statistic) and relative risk (ratio of observed co-occurrence of a pair to the expected co-occurrence of a null model) (Hidalgo et al. 2009; Folino et al. 2010; Chmiel et al. 2014; Jeong et al. 2017). These methods capture useful information about co-morbidities, and also have drawbacks. The ϕ-correlation underestimates the associations in disease pairs in which one disease is rare and the other is prevalent. The relative risk tends to overestimate linkages between rare diseases and to underestimate those between prevalent diseases. To use any of these methods, one inevitably chooses trade-off parameters to construct the network with reasonable accuracy. Examples include the thresholds in Ref. Chmiel et al. (2014), the choice of relative risk cutoff (4 in Ref. Jeong et al. (2017) and 20 in Ref. Folino et al. (2010)), and the choice of defining “lop-sided”ness if one direction of a reciprocal link weights at least twice as the other direction (Jeong et al. 2017). These thresholds are chosen to be intuitively-reasonable values considering the respective settings.

In this paper, we study different systematic statistical methods for building weighted directed comorbidity networks. These methods use different criteria to deem statistical significance for links. The resulting networks are sparser than the raw network, and the links are in some sense adjudicated as meaningful, that is, non-noise. In addition to statistical considerations, working with sparser networks is easier both computationally and intuitively, and the ultimate goal of gaining insight about paths of disease progression is facilitated. Here we investigate the effect of the statistical procedure used to build a network from the disease co-occurrence data on the structure of the resulting network. We show that depending on the null model used for defining the statistical significance of disease-disease links, different aspects of the comorbidity patterns are captured, and the resulting networks can have different micro/meso structures, and the centrality/ranking measures of individual diseases can differ. We describe the networks built from each method, discuss their similarities and differences, and present several example applications using these constructed networks.


Using the registry of all medically insured people in the province of Québec (fichier d’inscription des personnes assures - FIPA) we randomly sampled 25% of the people residing in the Montreal Census Metropolitan Area (CMA) in 1998. In each subsequent year, we used the FIPA to re-sample immigrants to the CMA and babies born to mothers residing in the CMA to maintain a representative, 25% sample for each year. For sampled individuals, we obtain regular data updates from the Régie de l’assurance maladie du Québec (RAMQ) on physician billing, drugs dispensed, hospitalization records, and death certificates. The data sets are linked with an anonymized unique identifier. At any given time, the dynamic cohort contains approximately 1 million people and follow-up data span approximately 17 years.

Moreover, in one of the applications that we present below, we use the dataset that is publicly available via Ref. Park et al. (2009) to connect our results to previous findings in the literature. In this data set, the protein–protein interaction (PPI) and coexpression networks and the inter-disease network of shared genes are linked to the comorbidity network derived from US Medicare claims of over 13 million elderly patients. The data set can be accessed online via

The analyses reported in this paper has been conducted using MATLAB R2015b.

Network construction methods

ICD codes

We use the ICD9 coding scheme for the classification of diseases. To make the analysis more tractable, we confine the analysis to the 3-digit classification.

Network terminology

Throughout, the pathways of disease progression are modeled by a network, where nodes represent diseases and a link from node i to node j represents an instance of diagnosis of disease i followed by a subsequent diagnosis of disease j. We denote the number of connections of a node by its degree, denoted by k. The weight of the link from disease i to disease j is denoted by wij, which is equal to the number of times a diagnosis of disease i followed by a diagnosis of disease j is reported in the data set. By the strength (Serrano et al. 2009) of a node, denoted by s, we refer to the sum of the weights of its links. We use these for either directions of the links. For example, the ‘out-strength’ \(s_{x}^{\text {out}}={\sum \nolimits }_{y} w_{xy}\) denotes the sum of the weights of the out-links of node x to other nodes, and the out-degree \(k_{x}^{\text {out}}\) denotes the number of such out-links. The out-strength of a node is equal to the total number of times that the diagnosis of that disease was followed by the diagnosis of any other disease. The out-degree of a node is the number of distinct diseases that follow that particular disease, without counting the multiplicities. Similarly we can define the in-strength and in-degree for each node. We denote the sum of the strength of all links by S, that is, we have \(S={\sum \nolimits }_{ij} w_{ij}\).

Raw network

In our data set, there are 1,700,000 distinct hospital visits, and the total number of unique ICD9-coded diagnoses is 6,500,000. Among all the hospital visits, 35.3% where given only one ICD9-coded diagnosis. Figure 1 presents the histogram of the number of ICD9-coded diagnoses per hospital visit. Table 1 presents the top 10 disease in the data set with highest prevalence. Figure 2 depicts the histogram of the prevalence of the diseases in our data set. The distribution of the logarithm of the prevalences is normal-like, but the result of the Kolmogorov-Smirnov test was that the normality assumption is rejected (on the 0.1 level). Though not strictly log-normal, the prevalence distribution is evidently heavy-tailed, that is, most diseases have low levels of prevalence and a minority of the diseases have extremely high levels of prevalence. The starting point of our analysis is to build a raw network, which will be the substrate on which other methods construct different derived networks.

Fig. 1
figure 1

Hospital visits. The distribution of the number of ICD9-coded diagnosis per hospital visit

Fig. 2
figure 2

Distribution of disease prevalence. The logarithm of the prevalences is normal-like, but the Kolmogorov-Smirnov test rejects the normality hypothesis

Table 1 Top 10 most-prevalent diseases in our data set

We seek a weighted and directed characterization of the comorbidity patterns, where the weight of the link from disease i to j equals the number of instances where a patient with disease i later developed disease j. The raw network is made by sweeping over every hospital visit, and incrementing the weight of the link from disease i to disease j if in that visit disease j is diagnosed for the first time in a patient for whom disease i had been previously diagnosed. In other words, if a patient who previously had disease i but did not have disease j visits the hospital and is diagnosed with disease j, then wij increments by one. There are hospital visits where two diseases are co-diagnosed in a patient for the first time. We do not observe what the temporal order of their occurrence was before the patient visited the hospital. The link between the two diseases might be in either direction. We have two possible choices: either to discard this observation or to count this link in both directions. We choose the former, because we prefer less data to more-but-noisy data. About 35% of the patients in the data set only visited the hospital once, thus they did not constitute any comorbidity trajectory with the above criteria, and did not contribute to the raw network. The weight distribution of the links are depicted in Fig. 3. After undertaking the maximum likelihood method devised in Clauset et al. (2009), we conclude that although the distribution resembles a linear curve on the log-log scale, the hypothesis that the weight distribution is power-law is rejected. Thus throughout the paper we only use non-parametric statistical methods. We do not assume scale-freeness of the distributions.

Fig. 3
figure 3

Distribution of link weights. The distribution of link weights. According to the method devised in Ref. Clauset et al. (2009), the null hypothesis that the weights have a power-law distribution is rejected

Relative risk and observed-to-expected ratio

The relative risk (RR) is a measure of comorbidity strength used in previous studies of disease networks (Hidalgo et al. 2009; Park et al. 2009; Jeong et al. 2017). In these studies, the relative risk is equal to the ratio of the number of times that an ordered pair of diseases occur in the empirical data to the expected number of times it would occur in a random network. In Ref. Hidalgo et al. (2009), the threshold value for this quantity was obtained from formulas for confidence intervals given in Ref. Katz et al. (1978).

Here we make a notational clarification. The problem considered in Ref. Katz et al. (1978) is the problem of finding confidence intervals for risk ratios in the following sense. Consider the 2×2 contingency table given in Table 2. In the following calculations, we use the values of a,b,c,d defined in Table 2 for brevity of notation.

Table 2 The contingency table to analyze the comorbidity of diseases i and j

The relative risk considered in Ref. Katz et al. (1978), and conventionally in epidemiology and biostatistics, is defined as \({\left (\frac {a}{a+b}\right)/ \left (\frac {c}{c+d}\right)}\). But that is not how RR is defined in Refs. Hidalgo et al. (2009); Jeong et al. (2017); Park et al. (2009) to study comorbidity links. Rather, these studies define RR as \({\frac {a }{(a+b)(a+c)}}\). The numerator is the observed co-occurrence proportion, and the denominator is the expected co-occurrence proportion under independence. The quantity \({\frac {a}{(a+b)(a+c)}}\) is actually what is often called the observed-to-expected ratio. Herein we denote it by OER. So we use this terminology in our paper:

$$\begin{array}{*{20}l} { OER_{ij}= \frac{ w_{ij}\times S}{ s_{j}^{\text{in}} s_{i}^{\text{out}}}. }\end{array} $$

Note that we can equivalently write:

$$\begin{array}{*{20}l} { \log OER_{ij} = \log \frac{\frac{w_{ij}}{S}}{\left(\frac{s_{j}^{\text{in} }}{S} \right) \times\left(\frac{s_{i}^{\text{out}}}{S} \right)}. }\end{array} $$

In this form, logOER is equivalent to the point-wise mutual information that is used, for example, in natural language processing to measure how likely two words are to co-occur (Bouma 2009). The confidence intervals for OER can be obtained by applying the delta method to logOER:

$$\begin{array}{*{20}l} \text{var}(\log OER) &= \frac{1}{S} \left[ \begin{array}{c} \frac{b c-a^{2}}{a (a+b) (a+c)} \\ \frac{-1}{ a + b} \\ \frac{ -1 }{ a+c} \\ 0 \end{array} \right]^{T} \left[ \begin{array}{cccc} a(1-a) & - a b & - a c & - a d \\ - a b & b(1-b) & - b c & - b d \\ - a c & - b c & c(1-c) & - c d \\ - a d & - b d & - c d & d(1-d) \end{array} \right] \left[ \begin{array}{c} \frac{b c-a^{2}}{a (a+b) (a+c)} \\ \frac{-1}{ a + b} \\ \frac{ -1 }{ a+c} \\ 0 \end{array} \right] \\ &= \frac{ bc(1-a)+a^{2}(1-b-c)-a^{3}}{a (a+b) (a+c)S},\end{array} $$

where T denotes transpose. So the 95% confidence intervals are obtained as follows: \( \text {CI}(OER)= \left [\widehat {OER} e^{-1.96 \sqrt {\text {var}(\log OER)}}~,~ \widehat {OER} e^{+1.96 \sqrt {\text {var}(\log OER)}} \right ] \).

We re- iterate that what we have defined here as OER is what the previous studies of comorbidity networks have referred to as the relative risk, and what we define below as the relative risk is not to be confused with their notation. As discussed above, for the relative risk we define:

$$\begin{array}{*{20}l} { RR= \frac{\frac{w_{ij}}{s_{i}^{\text{out}}}}{\frac{s_{j}^{\text{in}}-w_{ij}}{S-s_{i}^{\text{out}}}} }\end{array} $$

So the relative risk is the ratio of the probability that j receives one of the out-links of i to the probability that j receives a link from another disease that is not i. In other words, the relative risk is the ratio of the probability that disease j succeeds disease i to the probability that disease j succeeds any other disease. For the variance of the relative risk, we can proceed similar to before and obtain the following result which is equivalent to what is used in Ref. Hidalgo et al. (2009) for OER:

$$\begin{array}{*{20}l} {}\text{var}(\log RR) &= \frac{1}{S}{ \left[ \begin{array}{c} \frac{ b }{ a (a + b)} \\ \frac{-1}{ a + b} \\ \frac{ -d }{ c(c+d)} \\ \frac{1}{ c + d } \end{array} \right]^{T} \left[ \begin{array}{cccc} a(1-a) & - a b & - a c & - a d \\ - a b & b(1-b) & - b c & - b d \\ - a c & - b c & c(1-c) & - c d \\ - a d & - b d & - c d & d(1-d) \end{array} \right] \left[ \begin{array}{c} \frac{ b }{ a (a + b)} \\ \frac{-1}{ a + b} \\ \frac{ -d }{ c(c+d)} \\ \frac{1}{ c + d } \end{array} \right]}\\ &=\frac{ b}{ a (a+b)S} + \frac{ d}{ c(c+d)S} \end{array} $$

Despite the technical distinction between RR and OER, we can show that for practical purposes considered in this paper, these two measures are very close for almost all cases. Dividing Eq. 1 by Eq. 4, we have:

$$\begin{array}{*{20}l}{ \frac{RR}{OER}= \frac{1-w_{ij}/s_{j}^{\text{in}}}{1-s_{i}^{\text{out}}/S} \simeq 1-w_{ij}/s_{j}^{\text{in}}. }\end{array} $$

This ratio is very close to one if wij is much smaller than \(s_{j}^{\text {in}}\), that is, if disease i is not a main predecessor of disease j in comorbidity patterns. However, if the in-degree of disease j is small, so that it has few predecessors, then RR might deviate from OER. In our data set, there are only 61 disease pairs for which \(w_{ij}/s_{j}^{\text {in}}\) exceeded 10%. Since this fraction is negligibly small, the RR and OER measures are therefore almost identical. We use the network constructed based on OER in the following analyses to be consistent with the measures used in the previous literature.

A practical caveat of OER is that diseases with very low prevalence can produce unduly large values of OER, which is evident from Eq. 2. This is also pointed out previously in the disease networks literature (Park et al. 2009). A workaround is to discard disease pairs for which the expected co-occurrence under independence is greater than a certain threshold. As investigated in Ref. Park et al. (2009), as long as the threshold exceeds unity, the structure of the OER comorbidity network remains robust against the choice of threshold. In the present paper, we choose the threshold to be equal to unity.

ϕ coefficient

The ϕ correlation coefficient is a measure of association for two binary variables (here, the binary variable indicates whether or not a certain disease is diagnosed). It quantifies the tendency of the two binary variables to co-occur, that is, the concentration of the contingency table towards the diagonal. Generalizing the undirected case considered before in the literature (Hidalgo et al. 2009), we can define the directed version of the ϕ coefficient as follows:

$$\begin{array}{*{20}l}{ \phi_{ij}= \frac{w_{ij} S- s_{i}^{\text{out}} s_{j}^{\text{in}}} {\sqrt{ s_{i}^{\text{out}} s_{j}^{\text{in}} \left(S- s_{i}^{\text{out}}\right) \left(S- s_{j}^{\text{in}}\right) }}.}\end{array} $$

If the instances of disease j succeeding disease i are more frequent than the random case, the ϕ coefficient will be positive. If these instances are less frequent than it would be expected for the random case, the ϕ coefficient will be negative, and it means that having developed disease i actually decreases getting disease j. The caveat in the performance of the ϕ coefficient is that, the higher the disparity between the prevalences of the two diseases, the less informative the ϕ coefficient becomes (Hidalgo et al. 2009). A technical caveat of the ϕ coefficient is that, although it is always in the [−1,+1] interval, the absolute values of its theoretical extrema are less than unity (Guilford 1965; Davenport Jr and El-Sanhurry 1991; Park et al. 2009). In the case of positive association, assuming \(s_{j}^{\text {in}}< s_{i}^{\text {out}}\), the maximum value of ϕ which is theoretically attainable is \(\phi _{\text {max}}=\frac {\left (S-s_{i}^{\text {out}}\right) s_{j}^{\text {in}}}{\left (S-s_{j}^{\text {in}}\right)s_{i}^{\text {out}}}\). Because both \(s_{i}^{\text {out}}\) and \(s_{j}^{\text {in}}\) are much smaller than S in our analysis, this maximum value is close to \(s_{j}^{\text {in}}/s_{i}^{\text {out}}\), which is less than unity. If we had started with the assumption \(s_{j}^{\text {in}}>s_{i}^{\text {out}}\), we would have found the inverse of the latter fraction, so the maximum value would again be smaller than unity. A possible workaround would be to normalize ϕ via dividing it by ϕmax (Davenport Jr and El-Sanhurry 1991; Park et al. 2009; Chmiel et al. 2014).

The ϕ coefficient is related to the conventional χ2 statistic in the following way: ϕ2=χ2/S. To determine statistical significance and to find a p-value for the ϕ coefficient, Ref. Hidalgo et al. (2009) has employed a t-distribution approximation. The conditions for the validity of the approximation are not discussed. Ref. Chmiel et al. (2014) uses a χ2 test. Moreover, it is important to note that the χ2 test involves underlying assumptions regarding minimum expected cell counts in the 2×2 table (Everitt 1992). In our setting, characterized by Table 2, the expected cell count for the (1,1) cell is \(s_{i}^{\text {out}}s_{j}^{\text {in}}/S\), which is smaller than unity for many existing disease pairs in our data set. Thus the requirements for the χ2 test are not always met. In Ref. Park et al. (2009), the distributions of wij is assumed to be binomial, and a Poisson approximation is used to calculate the p-values. It is of note that the Poisson approximation to the binomial distribution also requires the expected co-occurrence of the disease pairs not to be large (usually O(1) is advised in statistics textbooks). For about 10% of the disease pairs in our data set, this condition is not met. So this method is also not applicable to our data set. In this paper, we used Fisher’s exact test to decide the significance of association.

Disparity filter

The disparity filter (DF) has a local node-based approach to define the network null model (Serrano et al. 2009). We first consider unweighted networks for explanatory purposes. The DF method asks that, for node x with given strength sx and degree kx, what would we expect the weights of its links to look like if they were allocated randomly? In the null model of the DF method, each node is assumed to possess a given strength sx that is to be distributed among its kx neighbors uniformly at random. That is, if the null is true, node x has no preference among its neighbors, and it would distribute its weights uniformly at random. We can mathematically conceptualize this setting as follows: In the unit interval [0,1], k−1 points are drawn uniformly at random, thus leading to k shares of strength pertaining to k different links. Without loss of generality, consider the left-most interval (between 0 and the left-most randomly-chosen point). The probability that the length of this share (which corresponds to the weight of one of the links) is greater than x is equal to the probability that all of the other k−1 points fall in a piece with length 1−x. This probability equals (1−x)k−1. This coincides with the p-value for that link, because it gives the probability that, under the null, the share of that link exceeds the observed value.

The said procedure can be undertaken either for out-degrees or in-degrees, yielding two distinct backbone networks. We define the p-value:

$${ \left\{\begin{array}{l} \alpha_{ij}^{\text{out}}= \left(1-\frac{w_{ij}}{s_{i}^{\text{out}}}\right)^{\left(k_{i}^{\text{out}}-1\right)} \\ \\ \alpha_{ij}^{\text{in}}= \left(1-\frac{w_{ji}}{s_{i}^{\text{in}}}\right)^{\left(k_{i}^{\text{in}}-1\right)} \end{array}\right. } $$

Then, a global thresholding can be done for a desired level of significance by discarding links whose α values exceed a certain threshold value. In this paper we set α=0.05 as the threshold value. We only consider the out-network in this study, because we are concerned mainly with finding diseases that increase the risk of developing other diseases (that is, those who perform as ‘roots’ in comorbidity paths).

Iterative proportional fitting procedure

The Iterative Proportional Fitting Procedure (IPFP) is a simple method used in the context of US inter-county migration flows (Slater 2009a; 2009b). Here we utilize it to analyze disease flows. This method utilizes the Sinkhorn-Knopp algorithm (Sinkhorn and Knopp 1967), which involves iteratively normalizing the rows and columns of the adjacency matrix until the row and column sums are sufficiently close to unity. In the IPFP method, after constructing this bistochastic matrix via successive normalizations, we start from an empty network and add the links successively in the decreasing order of their weight in the bistochastic matrix until the largest connected component of the network comprises every node. One can also use a global thresholding to obtain sparser networks. As the threshold is lowered, more and more links whose values in the bistochastic matrix exceed the threshold are allowed in. In this paper, we retain the top 5% of the heaviest links of the resultant bistochastic matrix.

The mathematical procedure for the IPFP method is as follows. Suppose we seek B, a transformation of the adjacency matrix A between the N distinct diseases, and we impose the condition that every disease in B must have the same number of preceding diagnoses and the same number of succeeding diagnoses as every other disease. If we interpret disease progressions as flows between diseases, this condition is the equality of in-flow and out-flow for every disease. This fixed amount is arbitrary and can be set to unity, so the elements of B can be interpreted as probabilities. We can also normalize the elements of A by S, that is, we can interpret wij/S as the fraction of the total inter-disease flux that flows from disease i to disease j, so it can be interpreted as a probability. Consider the following minimum-cross-entropy estimation problem:

$$\begin{array}{*{20}l} \underset{\{B_{ij}\}}{\text{minimize}} \quad &\sum\limits_{i,j=1}^{N} B_{ij} \log \frac{SB_{ij}}{w_{ij}} \\ \text{subject to} \quad &\sum\limits_{j=1}^{N} B_{ij}=1, ~~~ i=1\ldots N, \\ &\sum\limits_{i=1}^{N} B_{ij}=1, ~~~ j=1 \ldots N.\end{array} $$

The task is to minimize the Kullback-Leibler divergence between the target matrix B and the adjacency matrix A (whose ij element is wij), given the normalization constraints of rows and columns of B. We denote the Lagrangian multipliers for the first constraint (normalization of rows) by λj and those of the second constraint (normalization of columns) by μi. The Lagrangian is:

$$\begin{array}{*{20}l}{ \mathcal{L} = \sum\limits_{ij} B_{ij} \log \frac{SB_{ij}}{w_{ij}} - \sum\limits_{i} \lambda_{i} \left(\sum\limits_{j}B_{ij}-1 \right) - \sum\limits_{j} \mu_{j} \left(\sum\limits_{i} B_{ij}-1 \right).}\end{array} $$

Setting the components of the gradient equal to zero, we get:

$$\begin{array}{*{20}l} { \log \frac{SB_{ij}}{w_{ij}} + 1 = \lambda_{i} + \mu_{j} \Longrightarrow B_{ij} =\frac{ w_{ij}}{S} e^{\lambda_{i} + \mu_{j} -1} }\end{array} $$

Let Λ be a diagonal matrix with positive elements \({\exp (\lambda _{i}-1/2)/\sqrt {S}}\) and let M be a diagonal matrix with positive elements \({\exp (\mu _{j}-1/2)/\sqrt {S}}\). Then, according to Eq. 11, the above-formulated maximum entropy problem becomes equivalent to finding a bistochastic matrix B such that: B=ΛAM, with the following pair of coupled equations holding: \( \Lambda _{ii} = 1/{\sum \nolimits }_{j} A_{ij} M_{jj}\) and \(M_{ii}= 1/{\sum \nolimits }_{i} A_{ij} \Lambda _{ii}\). This is equivalent to what the Sinkhorn-Knopp theorem states: iterating the matrices Λ and M from the latter pair and inserting the limiting result into the equation B=ΛAM, the unique bistochastic matrix B is obtained (Sinkhorn and Knopp 1967). For discussions regarding convergence of the iteration, see Refs. Sinkhorn and Knopp (1967); Chakrabarty and Khanna (2018). In brief, the method converges if and only if the adjacency matrix has total support, which means that for every nonzero element A, there exists a column permutation of A such that the nonzero element is brought to the main diagonal and every diagonal element is nonzero. Note that adding a nonzero constant to every diagonal element of the adjacency matrix (which is equivalent to adding a self-link for every disease) guarantees this property, because for every off-diagonal element we can simply swap its column such that it is brought to the main diagonal, and the main diagonal of the resulting matrix is already all-positive. Such an addition, akin to Laplace smoothing in machine learning (Schütze et al. 2008), is equivalent to viewing each disease as succeeding itself after a prior diagnosis, because any two checks for the same patient during the period of an illness would produce such a self-link. We performed a robustness check regarding the amount added to the main diagonal. We observed reasonable robustness for any added value up to O(101) for the network measures that we invoked in this paper. So we used unity; we increment the diagonal of the adjacency matrix by unity and then applied the IPFP procedure by iteratively normalizing columns and rows until sufficient convergence.

The IPFP method has another property that facilitates the interpretation of its function. Consider the m-th stage of the iterative normalization procedure in the IPFP method. Denote the adjacency matrix at this stage by A(m). Denote the sum of row i at this stage by \(r_{i}^{(m)}\) and the sum of column i by \(c_{i}^{(m)}\). Denote the adjacency matrix after row normalization by A(m+1/2), and denote the result of the subsequent column-normalization by A(m+1). The element ij of A(m+1/2) is given by

$$\begin{array}{*{20}l} { A_{ij}^{(m+1/2)}= \frac{A_{ij}^{(m)}}{\sum\limits_{a} A_{ia}^{(m)}}. }\end{array} $$

After the subsequent column normalization, we have

$$\begin{array}{*{20}l}{ A_{ij}^{(m+1)}=\frac{A_{ij}^{(m+1/2)}}{\sum\limits_{b} A_{bj}^{(m+1/2)}}= \frac{\frac{A_{ij}^{(m)}}{\sum\limits_{a} A_{ia}^{(m)}}} {\sum\limits_{b} \frac{A_{bj}^{(m)}}{\sum\limits_{a} A_{ba}^{(m)}}} = \frac{A_{ij}^{(m)}}{ r_{i}^{(m)} \sum\limits_{b} A_{bj}^{(m)}/r_{b}^{(m)}} }\end{array} $$

Therefore, for two disease pairs ij and k, we have:

$$\begin{array}{*{20}l}\frac{A_{ij}^{(m+1)} A_{k\ell}^{(m+1)}}{A_{i\ell}^{(m+1)}A_{kj}^{(m+1)}}= \frac{A_{ij}^{(m)} A_{k\ell}^{(m)}}{A_{i\ell}^{(m)}A_{kj}^{(m)}} \times \left(\frac{r_{i}^{(m)} \sum\limits_{b} A_{b\ell}^{(m)}/r_{b}^{(m)} \times r_{k}^{(m)} \sum\limits_{b} A_{bj}^{(m)}/r_{b}^{(m)}} {r_{i}^{(m)} \sum\limits_{b} A_{bj}^{(m)}/r_{b}^{(m)} \times r_{k}^{(m)} \sum\limits_{b} A_{b\ell}^{(m)}/r_{b}^{(m)}} \right) = \frac{A_{ij}^{(m)} A_{k\ell}^{(m)}}{A_{i\ell}^{(m)}A_{kj}^{(m)}}. \end{array} $$

Thus the quantity \( \left [ A_{ij}^{(m)} A_{k\ell }^{(m)}\right ]/ \left [A_{i\ell }^{(m)}A_{kj}^{(m)}\right ]\) is conserved. This can be interpreted as the odds ratio in the contingency table formed by the four diseases i,j,k,. This odds ratio is the same between the final bistochastic matrix and the original raw adjacency matrix. Thus the IPFP method focuses on relative inter-disease flows and discards the absolute link weights.

The GloSS filter

We discussed above that the disparity filter had a local approach; focusing on the distribution of link strengths of individual nodes among their immediate in-neighbors or out-neighbors. That is, the disparity filter assesses how likely a link strength is to be a nonrandom fluctuation in the links of an individual node. An alternative approach would be to allow the weights to be distributed globally, while still retaining the degrees fixed. This leads to the Global Statistical Significance (GloSS) filter (Radicchi et al. 2011). The GloSS filter assesses how likely a particular link weight is to be a nonrandom fluctuation in the whole network. This method works as follows. We first fix the network topology, that is, the node degrees and directions. This unweighted network is the substrate for the null model. Denote the empirically-observed weight distribution of links by \(\hat {p}(w)\). The null model is constructed by assigning to each link of the substrate network a value randomly drawn from the global empirical distribution \(\hat {p}(w)\).

We introduce the auxiliary probability distribution F(s,k), which is the probability that randomly drawing k values from the weight distribution \(\hat {p}(w)\) will yield values that sum up to s. It is straightforward to show that F(s,k) is obtained by convolving \(\hat {p}(w)\) with itself k times. More simply, we can take the inverse Fourier transform of the k-th power of the Fourier transform of the original distribution:

$$\begin{array}{*{20}l}{ F(s,k)= \frac{1}{2\pi} \int_{0}^{\infty} \left[\int_{0}^{\infty} \hat{p}(w) e^{i w \phi} dw \right]^{k} e^{-i s \phi} d\phi }\end{array} $$

Now we can apply Bayes’ rule to obtain:

$$\begin{array}{*{20}l} P\left(w_{ij}|s_{i}^{\text{out}},k_{i}^{\text{out}},s_{j}^{\text{in}},k_{j}^{\text{in}}\right) &= \hat{p}(w_{ij}) \frac{P\left(s_{i}^{\text{out}},s_{j}^{\text{in}}|w_{ij},k_{i}^{\text{out}},k_{j}^{\text{in}}\right)}{P\left(s_{i}^{\text{out}},s_{j}^{\text{in}}|k_{i}^{\text{out}},k_{j}^{\text{in}}\right)} \\ &= \hat{p}(w_{ij}) \frac{F\left(s_{i}^{\text{out}}-w_{ij},k_{i}^{\text{out}}-1\right) F\left(s_{j}^{\text{in}}-w_{ij},k_{j}^{\text{in}}-1\right)}{P\left(s_{i}^{\text{out}},s_{j}^{\text{in}}|k_{i}^{\text{out}},k_{j}^{\text{in}}\right)} \end{array} $$

We can thus compute αij, which is the probability that, under the null, the value of wij is greater than a given value. So we arrive at the p-value for the observed weight wij:

$$\begin{array}{*{20}l}{ \alpha_{ij}= \frac{\int_{w_{ij}}^{\infty} \hat{p}(w) F\left(s_{i}^{\text{out}}-w,k_{i}^{\text{out}}-1\right) F\left(s_{j}^{\text{in}}-w,k_{j}^{\text{in}}-1\right)dw} {\int_{0}^{\infty} \hat{p}(w) F\left(s_{i}^{\text{out}}-w,k_{i}^{\text{out}}-1\right) F\left(s_{j}^{\text{in}}-w,k_{j}^{\text{in}}-1\right)dw} }\end{array} $$

The method then proceeds by retaining those links whose p-values are smaller than a threshold value which sets the significance level. In this paper, we set α=0.05.

It is of note that the p-values of the links are not necessarily highly correlated with link weights. This feature enables a trade-off between topology and weight. That is, the method has the advantage that it can capture informative links whose weights might not be outstandingly large.

Link salience

An alternative approach for assessing the importance of a link is link salience (Grady et al. 2012). The link salience method takes a more global approach as compared to the previous methods. The underlying rationale of this method can be intuitively described with an analogy to road networks: if the network represents the network of roads between locations, then the link salience method is trying to partition the links into superhighways and roads (Wu et al. 2006). This method involves viewing the network from the point of view of every single node, and measuring how important a given link is viewed by all nodes. The algorithm works as follows. For a path {v1,…,v} between source node v1 and target node v, we define the total effective distance as \({\sum \nolimits }_{i=1}^{\ell } 1/w_{v_{i}v_{i+1}}\). So between any pair of nodes, we can define a shortest path as the path that minimizes the effective distances. For each node v, we find the shortest paths to every other node. This can be done by standard methods, such as Dijkstra’s algorithm. The shortest-path tree rooted at node v can be represented by a binary matrix T(v) with the same size as the original network. So T(v)ij is 1 if the link from node i to node j exists on at least one shortest path from v to some other node, and T(v)ij is zero otherwise. Finally, the salience of the network is defined as the following matrix: \(S=(1/N) {\sum \nolimits }_{v} T(v)\). This matrix gives the link salience values to be used for extracting the network skeleton.

There are several advantageous of this method. First, if the salience of a link is high (that is, close to unity), this means that it participates in most of the shortest-path trees. Thus, viewed from the point of view of the majority of the nodes, this link is important for reaching other nodes. This helps one capture pathways that are critical in reaching certain diseases. For instance, a certain link might not be particularly heavy, yet this method might pick it up because it is essential in reaching a certain disease which is otherwise isolated. Second, this method is highly robust regarding the choice of significance threshold. The distribution of link salience values calculated with the above procedure is bimodal: most links have salience concentrated near zero, a small minority have salience concentrated around unity, and the rest of the links take intermediate salience values. Such a bimodal nature of this distribution considerably facilitates the choice of threshold, because the links that fall in the intermediary region between the two peaks are a negligibly-small minority and any threshold value in this region will retain almost the same set of links. Third, this method enables characterizing the risks associated with links, rather than the nodes. The global nature of this approach enables extraction of ‘disease highways’, which are the main multi-disease progression paths. In this paper, we used the threshold value of 0.9 for the salience values. That is, we retain all the links with salience greater than 0.9.

Comparing networks

In this section we first provide a broad overview of the structure of the constructed networks. We then introduce different network measures for characterizing disease importance and apply these measures to the constructed networks. The measures that we use are in-strength, out-strength, eigenvector centrality, PageRank, Hubs and Authority (the HITS algorithm (Kleinberg 1999)), and betweenness centrality.

Overview of the function of different networks

Table 3 presents the summary statistics of the networks built with the above methods. For each network, we calculated the number of links, the percentage of links from the raw network that the method retains, and the number of nodes for each network. To describe the connectivity of nodes, we calculate the mean and standard deviation of the out-strengths and out-degrees, which are mathematically identical to those for the in-strengths and in-degrees, respectively. To characterize the relation between the degrees of nodes and their neighbors, we calculate the nearest-neighbor degree statistics as follows. For each node, we find the average out-degree of its out-neighbors. Then we calculate the mean and standard deviation of these values across all nodes. This measure quantifies the nearest-neighbor degree correlations of the network. For each network, we denote the average prevalence of the diseases kept by the method by 〈P〉. For each disease, we can find the average prevalence of its neighboring diseases. The average over all these average values is denoted by \(P_{nn}^{\text {out}}\) if we use the out-neighbors for calculations, and \(P_{nn}^{\text {in}}\) if we use the in-neighbors. Finally, we report a measure of homophily, calculated as the correlation between the prevalence of diseases and the average prevalence of the neighbors of diseases. This assortativity coefficient can be calculated using either the out-neighbors or the in-neighbors. The resulting coefficients are denoted by \(r_{P}^{\text {out}}\) and \(r_{P}^{\text {in}}\), respectively. These measures can also be visibly investigated from Figs. 5 and 6. Every constructed network is a subsample of the raw network topologically (disregarding the link weights). But there is no particular relation between the constructed networks, that is, one cannot be derived from the other, because the methods employ different filtering rationales. The OER, ϕ, and IPFP methods, assign new weights to the inter-disease links, which are different from the raw link weights which simply denote the number of diagnosis successions. In contrast, in the DF, Gloss, and Salience methods, filtering techniques are applied only to decide which links of the raw network to retain and which to discard, and the weights of the retained links remain intact.

Table 3 Summary statistics of the constructed networks

From Table 3 we can obtain quick insight into the differences between the function of different methods. For the raw network, the correlations between the prevalence of diseases with their in-neighbors and that with their out-neighbors, denoted by \(r_{P}^{\text {in}}\) and \(r_{P}^{\text {out}}\) respectively, are both negative. This is also evident from the negative slope of the curves in Figs. 5 and 6 corresponding to the raw network. Also, the prevalence distribution of the raw network, presented in Fig. 4, shows that the distribution of the prevalences has a lognormal-like shape, so most diseases have small prevalence, and high-prevalence diseases are relatively less common. Combining these two observations, we deduce that the connections are strongly disassortative and mutual. That is, the network has a core-periphery structure in which a few highly-prevalent nodes preferentially connect to many low-prevalence nodes, and vice versa.

Fig. 4
figure 4

Visualizing Prevalences. The distribution of the ln(prevalence) of the diseases in different constructed networks sheds light on their selection criteria. The GloSS method retains disproportionately high-prevalence diseases, the IPFP method discards disproportionately high-prevalence diseases, and the OER, DF, and salience methods, discard disproportionately low-prevalence diseases

Fig. 5
figure 5

Comparing prevalences of neighboring nodes. The horizontal axis pertains to the prevalence of the diseases, and the vertical axis depicts the distribution of the prevalences of their out-neighbors. The thick lines denote median values and the shades demarcate 0.25 and 0.75 quantiles. The axes are natural-logarithmic

Fig. 6
figure 6

Comparing prevalences of neighboring nodes. The horizontal axis pertains to the prevalence of the diseases, and the vertical axis depicts the distribution of the prevalences of their in-neighbors. The thick lines denote median values and the shades demarcate 0.25 and 0.75 quantiles. The axes are natural-logarithmic

The OER and ϕ methods retain comparable portions of the links, but the OER network discards about 10% of the nodes. We manually verified that all the diseases that the OER method discards are in-fact extremely low-prevalence. This is also visible in Fig. 4, where the left tail of the OER curve begins around 12, falling in the 6th percentile of the prevalence distribution of the raw network. So we observe that the OER network discards all the diseases in the bottom 5 percentile of prevalence. From Figs. 5 and 6, we also observe that the low-prevalence diseases are discarded by the OER method. We also observed that the curves corresponding to the OER method have negative slopes, similar to the raw network. This indicates that the OER method retains the structural core-periphery property discussed above. For the ϕ network, correlation between the prevalence of diseases with their in-neighbors and that with their out-neighbors are close to zero. This means that the disease pairs in the ϕ network neither exhibit homophily nor heterophily in prevalence. This is visible in Figs. 5 and 6, where the curves pertaining to the ϕ method have overall slope close to zero.

The DF network is considerably sparser than OER though it retains roughly the same number of nodes. In contrast to the OER and ϕ networks, the DF network exhibits a high imbalance between \(\langle P_{nn}^{\text {out}} \rangle \) and \(\langle P_{nn}^{\text {in}} \rangle \). This means that on average, the average prevalence of out-neighbors of nodes is five times greater than the average prevalence of the in-neighbors of nodes. This is expected by construction, because the way the DF network was built was to retain disproportionately-large out-links of each node, and discard the out-links with smaller weights. So for a typical disease, heavier out-links are systematically selected, and these are the links that typically point to high-prevalence diseases. In other words, high-prevalence diseases are the ones which appear with high frequency among either the in-neighbors or the out-neighbors of a typical disease, and the DF method systematically discards light out-links without doing the same to the in-links, thereby creating an imbalance. Consequently, high-prevalence nodes become more likely to appear among the out-neighbors as before. This asymmetry is also evident from the mismatch between the signs of \(r_{P}^{\text {in}}\) and \(r_{P}^{\text {out}}\). Moreover, as it is visible in Fig. 4, the DF method discards diseases with low prevalence almost entirely, which is a feature it shares with the OER method.

The IPFP network is sparser than OER, ϕ, and DF networks. The average prevalence of the diseases that IPFP retains is lower than the other three methods. Same is true for the median of the prevalences. This means that the IPFP filter discards disease with extremely high prevalence. This is also evident from the tail behavior of the distribution of log(prevalence) in Fig. 4, where the tail of the IPFP curve visibly plummets below the other curves. The \(\langle P_{nn}^{\text {out}} \rangle \) and \(\langle P_{nn}^{\text {in}} \rangle \) values of the IPFP network are an order of magnitude smaller than the corresponding values in the OER, ϕ, and DF networks. Thus the core-periphery feature is nonexistent in the IPFP network, and the prevalences of the disease pairs are on average less unequal than other networks. Moreover, the IPFP method rewards intermediacy. As Fig. 4 shows, the intermediate-prevalence diseases constitute a higher fraction of the IPFP method as compared to the raw network.

The GloSS filter has a unique property: the average prevalence is twice as high as the raw network. So the GloSS method has retained disproportionately high-prevalence diseases. This is also evident from Fig. 4. The prevalence curve pertaining to the Gloss method is shifted to the far right. In fact, the lowest prevalence in the GloSS network is 1176, which in the raw network, is at the 47 percentile. So the Gloss network discards about half of the diseases, and it suffers from this problem much more severely than OER and DF methods. The values of \(\langle P_{nn}^{\text {out}} \rangle \) and \(\langle P_{nn}^{\text {in}} \rangle \) are also much greater than the average and median values of P. Moreover, \(r_{P}^{\text {in}}\) and \(r_{P}^{\text {out}}\) are strongly negative. These indicate a strong mutual core-periphery structure. This feature was also present in the raw network as discussed above, but is markedly accentuated by the GloSS method. The GloSS method discards low-prevalence diseases, and retains medium and high-prevalence diseases. As Figs. 5 and 6 illustrate, in the remaining network, the out-links of high-prevalence diseases are preferentially towards medium-prevalence diseases, and conversely, the out-links of medium-prevalence diseases are preferentially towards high-prevalence diseases.

The Salience network is the most sparse, retaining less than 2.5% of all the links in the raw data. The Salience network is small by construction because it seeks to retain a small fraction of links that would capture the macro skeleton of the network. Since according to Table 3 the number of links is smaller than the number of nodes in this network, we deduce that the Salience network is disconnected. Via visualization we verified that the Salience network is in fact segregated to disjoint connected components. The Salience method, like the OER, DF, and GloSS methods, is biased towards high-prevalence diseases. However, the degree to which the Salience method suffers from this bias is comparable to the OER and DF methods, and is not as severe as the Gloss method. Another notable uniqueness of the Salience method is the great difference between \(\langle P_{nn}^{\text {out}} \rangle \) and \(\langle P_{nn}^{\text {in}} \rangle \), with the former being about 60 times greater than the latter. This was the case for the DF network too, but this ratio was about five, which is a much smaller difference than in the Salience network. To investigate if the high skew of the prevalence distribution is disproportionately affecting the averages, we repeated the calculations using the median instead of average, and we observed the same pattern both for DF and salience networks. This indicates a highly-unequal three-level structure in the Salience network: the Salience network comprises many locally-core-periphery substructures, where the peripheral nodes preferentially connect to the core nodes, but the core nodes do not reciprocate. Some of the core nodes connect to other core nodes, and some of them do not. That is, there are three types of nodes in the Salience network: (i) core nodes with high in-strength (received by many unilateral links from small peripheral nodes), (ii) nodes with high in-strength and intermediate out-strength (the in-flow comes form many peripheral nodes, the out-flow goes into the core nodes from the first category), and (iii) the small peripheral nodes who do not receive any in-flow and form unilateral links towards the former two categories. This topology can be simply conceptualized as follows: consider a star graph with many leafs, all unilaterally linking to the central node. Now suppose we take a fraction of the leafs, and for each of them introduce some new leaf nodes that unilaterally connect to them, turning them into mini-authorities. This structure exhibits the correlation properties observed in Table 3 for the Salience network. We also visually verified this hypothesis. Note that the lack of reciprocation from high-prevalence nodes is in contrast to the raw and OER networks, which although highly unequal, comprised mostly-bidirectional links.

Applicability of different methods

To summarize, the OER and ϕ networks focus on disease-disease relationships in the usual statistical way, that is, in the absence of network structure. If the task of a study is to investigate the comorbidity between a certain pair of diseases, then these methods are suitable. The OER network disproportionately discards diseases with low prevalences, and the ϕ network disproportionately discards disease pairs with highly-unequal prevalences. The DF network is better if the questions of comorbidity are being formulated conditional on having developed certain diseases first, and if one wants to compare between the risks of different diseases that succeed the given initial diseases. The DF method has the disadvantage that it discards diseases with low prevalence. The IPFP method has an egalitarian approach which controls for disease prevalence. This method prevents the results from being dominated by high-prevalence diseases. It asks if all the diseases had the same in-flow and the same out-flux, what would be the best estimate of the comorbidity matrix, given the information on the empirical matrix? In other words, the IPFP method investigates comorbidity patterns controlled for individual disease prevalences. The Gloss method assesses link weight fluctuations on a global level. So this method is preferred when the task of the study is to compare comorbidity links globally, not focusing on a particular disease. That is, if the emphasis of the study is on links rather than the nodes. The Salience method is only relevant for disease trajectories globally and is not suitable for studying comorbidity statistics. The Salience method is suitable if, for instance, one would like to investigate the expected distance between certain diseases, that is, how many intermediate diseases it would typically take to develop disease B having developed disease A. The OER, ϕ, and DF methods have a local approach. The Gloss filter and the Salience method have global approaches. The IPFP method has a meso-scale approach.

Example applications

In what follows we calculate several conventional measures of node importance. We investigate the agreement between different networks on the importance profile of diseases. We denote the diseases by 3-digit ICD9 codes.

Different measures for node importance

Node Strength. For directed weighted networks, we can use the out-strength and in-strength to characterize nodal connectivity, as discussed above. Table 4 presents the top 5 diseases with highest in-strength, and Table 5 presents the top 5 diseases with highest out-strength.

Table 4 Top 5 diseases with highest in-strength in different networks
Table 5 Top 5 diseases with highest out-strength in different networks

Eigenvector centrality. A basic measure to characterize the centrality of nodes in a network is the eigenvector centrality (Bonacich 1987). The basic intuition behind this measure is that important nodes are those that are connected to other important nodes. This yields a self-consistent linear set of equations that yields the centrality scores of nodes. Table 6 presents the top 5 diseases with highest eigenvector centrality for different networks.

Table 6 Top 5 diseases with highest eigenvector centrality in different networks

The rankings of top nodes show the differences between the networks constructed by the different methods. It is of note that eigenvector centrality tends to capture nodes with high in-strength. The OER and IPFP methods predominantly pick up pregnancy-related diagnoses, whose comorbidity links (preceding or following other diseases or conditions) to several different categories of diseases are well-researched (Desai et al. 2007; James et al. 2005; Brabin et al. 2001; Kittner et al. 1996). The major difference between pregnancy-related ICD codes (630-679) and other categories is cohesion. As shall be discussed below in Tables 19 and 20, the aggregated in-strength and out-strength of this category of nodes is not high as compared to other categories (it ranks among the bottom 5 in both cases), but interestingly, in terms of the weights of within-category links, this category has an outstandingly large share (it ranks second, after the diseases of the circulatory system). This means that the pregnancy-related nodes form a cohesive subnetwork. This dense clique-like structure is comprised of disease with intermediate-level prevalence with prevalence values all relatively close to one another. The OER coefficient is high for these disease pairs because in addition to high co-occurrence, all diseases have intermediate levels of prevalence, so the overestimation and underestimation tendencies of the OER method are not encountered. Another family of diseases that frequently have high values of eigenvector centrality in all network construction methods is the family of 28X diseases, which are the diseases of the blood and blood-forming organs. Notable diseases in this family include different types of anemia, Haemophilia, and diseases of white blood cells. Iron deficiency anemias (280) and ‘Other and unspecified anemias’ (285) appear consistently higher than Haemophilia. This is mainly caused by the high prevalence of 280 and 285 anemias (with prevalences 40,000 and 107,000, respectively), as compared to Haemophilia, whose prevalence is about 10,000 in our data set.

Table 7 presents the mutual correlation coefficients for the eigenvector centrality of nodes computed for different networks. The IPFP method seems uncorrelated or negatively correlated with the other methods. Most methods are strongly correlated in terms of eigenvector centrality. The Salience method focuses on trajectories and distorts the degrees, so the eigenvector centrality of this method is weakly correlated with that of the other methods. The IPFP method changes the strengths and assigns new link weights after controlling for disease prevalences. The IPFP method has a strong negative association with the GloSS method, because by construction, the GloSS method rewards high-prevalence diseases and the IPFP method does the converse.

Table 7 The correlation matrix for the eigenvector centrality of nodes between different networks

PageRank. The second centrality measure that we use is the PageRank (Brin and Page 1998). The PageRank algorithm was originally used to characterize the importance of websites. This algorithm basically quantifies the likelihood that a person clicking randomly on links will arrive at a given website (Xing and Ghorbani 2004). This method simulates random walks on the network, with a damping factor that characterizes the probability that the walk terminates at any step and restarts at a node chosen uniformly at random. We set the damping factor equal to 0.85, which is conventional in the literature. Table 8 presents the results for the top 5 diseases with highest PageRank for different networks. An intuitive approximation is that PageRank tends to focus on nodes with high in-strength (Fortunato et al. 2006). This is confirmed by comparing Table 8 with Table 4; many of the top nodes are common between these two tables. Comparing Table 8 with Table 6, we observe that IPFP is returning different results—mostly conditions pertaining to the perinatal period. We point out that most of these diseases are not particularly highly-connected nodes in the raw network. Motivated by this observation, we can gain intuition about how IPFP works by noting that, when normalization of rows or columns is performed at each stage, nodes with large degrees can lose their weight if their neighbors are also of large degree. For example, consider the row-normalization step, where rows (and similarly columns) are normalized to add up to unity. At this step, a row that corresponds to a node with out-degree 100 who is connected with heavy and equal out-links to its 100 out-neighbors will become identical to a row corresponding to a node with out-degree 3 who is connected with light and equal out-links to its 3 out-neighbors. In other words, this method is capturing something that is inherently of a different nature than every other method considered here. Table 9 presents the correlation values for PageRank scores in different networks. Similar to the case of eigenvector centrality, IPFP is negatively correlated with every other method. This observation highlights the difference between IPFP and other methods, and invites more investigation into what this method extracts.

Table 8 Top 5 diseases with highest PageRank in different networks
Table 9 The correlation matrix for the PageRank of nodes between different networks

Table 9 is also informative regarding the function of different methods. The PageRank scores of the raw network have a strong positive correlation with those in the ϕ, DF, and GloSS networks. The IPFP method shows a negative correlation, similar to the case of eigenvector centrality, because it punishes diseases with high prevalence and assigns a low strength to them, as discussed before. The Salience method constructs a network comprising entirely of non-mutual links and distorts the in-strength and out-strength patterns. The in-strength of most nodes are mapped to zero, because, as we discused above, the Salience method comprises mostly of highly-unequal substructures with peripheral nodes unidirectionally connected to core nodes. As mentioned above, the PageRank scores of nodes can be approximated with their in-strength (Fortunato et al. 2006). Since the Salience method maps the in-degrees of many nodes to zero (retaining only their out-links to highly-prevalent core nodes, as discussed above), the concomitant distortion in in-strengths results in a lower correlation between the PageRank of the Salience method and other networks.

Hubs and authority. An alternative way we could characterize the importance of nodes in terms of in-flow and out-flow is to employ the HITS algorithm (Kleinberg 1999). The HITS algorithm is a simple and intuitive method that was originally devised to characterize the rankings of websites. This algorithm focuses on simultaneously finding hubs and authorities on the web. In that context, a hub is a website that is influential in directing users towards other highly-ranked websites, and an authority is a website which gets directed to by highly-ranked nodes. In the context of diseases, a hub would be a disease which, if developed, increases the risks of developing other diseases. An authority would be a disease that follows many other diseases. Table 10 presents the results for the top 5 diseases in terms of hubness in different networks. Comparing the results in Table 10 with those of Table 6, we observe that the hub-ness scores for all categories correlate highly with those of eigenvector centrality, except DF and Salience. Table 11 presents the correlation matrix for hubness. The hubness score is important in comorbidity studies because the hubs that the HITS algorithm nominates are universal senders, and in the context of comorbidity studies, these would pertain to diseases that substantially increase the risk of many other diseases, demanding more prevention and care. Table 11 shows that there is good agreement between the hub scores of different methods, so despite their structural differences, the hub score is robust and can be reliably used. The two usual exceptions are present here as well: the IPFP method, and the Salience method. These are expected because their tasks are different: the hubness scores of the IPFP method pertain to an alternative inflow-outflow comorbidity matrix where the prevalences are controlled for, and the Salience method only focuses on distances and trajectories rather than actual disease-disease relations. Table 12 presents the top 5 nodes for authority, and Table 13 presents the correlation matrix for authority scores. For the authority index too, there is good agreement between every method except IPFP and Salience.

Table 10 Top 5 diseases with highest hub-ness scores in different networks, according to the HITS algorithm
Table 11 The correlation matrix for the hubness score of nodes between different networks, according to the HITS algorithm
Table 12 Top 5 diseases with highest authority scores in different networks, according to the HITS algorithm
Table 13 The correlation matrix for the authority score of nodes between different networks, according to the HITS algorithm

Betweenness centrality. Distance-based network measures capture different aspects of how essential a node is in the reachability between other pairs of diseases. Here we use betweenness centrality which characterizes the number of shortest paths between different disease pairs that pass through each given disease. There might be a disease that separates a dense module of diseases from the whole network, such that one would develop the diseases within the module only when one first develops this gate-keeper disease. Or, conversely, after developing a disease within the module, subsequent diseases outside the module occur typically after this gate-keeper disease is developed. Such a disease would have a high betweenness centrality. Another type of nodes that are typically characterized by high betweenness centrality are the core nodes in strong core-periphery structures, because to go from one peripheral node to another, one has to pass through the core nodes. So this measure is helpful in detecting this structural feature of diseases. Table 14 presents the top 5 nodes with highest betweenness centrality in the constructed networks. In the raw, OER, ϕ, DF, and Gloss networks, the top nodes are those with extremely high prevalence. The above-discussed core-ness underpins the high betweenness centrality of these nodes. The Salience method assigns high betweenness centrality scores to conditions pertaining to the perinatal period, which is potentially related to the first typical case of high betweenness mentioned above. That is, certain prenatal conditions act as gatekeeping conditions between conditions before birth and conditions after birth. The results of the IPFP method are less intuitive, consistent with the results for the previous measures. Table 15 presents the correlations across different networks.

Table 14 Top 10 diseases with highest betweenness centrality
Table 15 Correlation between the betweenness centrality of nodes across constructed networks

Example application: the role of disease prevalence

To get a better intuition into the network measures that we used to characterize diseases, we investigate the correlation between each network measure and disease prevalence in every constructed network. The results are presented in Table 16. In the raw network, every measure has a strong positive association with disease prevalence. Thus highly-prevalent diseases such as diabetes and hypertension receive a high score no matter the network measure used to characterize the diseases. Same is true for the DF and GloSS networks. In contrast, for the IPFP network, prevalence is negatively correlated with every network measure except hub and authority, and for the latter two the association is close to zero. For the Salience network, the correlation between the hub index and prevalence is negative. Together with the positivity of the correlation between the authority index and prevalence, this indicates that in the Salience network, high-prevalence nodes disproportionately receive in-flows and do not reciprocate. This is in contrast with every other method, where both of these correlations are positive, meaning that high-prevalence nodes are characterized by both large in-flows and out-flows.

Table 16 The correlation between the network measures of diseases and disease prevalence, for different constructed networks

Note that the exact value of prevalence for each disease cannot be recovered from the networks alone. This is due to the existence of disease progression trajectories with length greater than two. If every patient had a registry of form A →B, that is, only two diseases, then the prevalence of each disease would be simply the sum of the out-strength and in-strength of its corresponding node in the raw network. But because many instances of higher-order records such as A →B→C exist, and those with greater lengths, the prevalence information is lost. If we conceptualize the weighted links in the raw network as distinct links with unit weight, then in this picture, each link would represent exactly one patient if every disease trajectory had length one (that is, in the form A →B). But due to the presence of higher-order trajectories, more than one link can together pertain to a single patient, thus the prevalence information lost. However, Table 16 shows a strong correlation between the out-degree of diseases (about 0.95) and their prevalence. So if we did not have the prevalence data, we could use out-degree as a proxy for prevalence. It would be interesting to investigate if this correlation pattern between various network measures and prevalence would be replicated using data from other regions of the globe.

Example application: shared genes and protein-protein Interactions

We can use the results to investigate the relation to previous studies on disease networks, which in addition to comorbidity observations, incorporate other disease-disease linkages into the analysis. For example, in Ref. Park et al. (2009), the protein–protein interaction (PPI) and coexpression networks and the inter-disease network of shared genes are compared to the comorbidity records from US Medicare claims. Many observed comorbidity patterns are therefore linked to the shared PPIs and the shared genes of the diseases. For example, the significant comorbidity between Alzheimer’s disease and myocardial infarction is linked to their shared ACE and APOE genes. As another example, a PPI between the genes associated with the autonomic nervous system disorder and the carpal tunnel syndrome (IKBKAP and TTR, respectively) is suggested to contribute to the statistically-significant comobrnidity between these two diseases. We can investigate how the results of Ref. Park et al. (2009) are reflected in our constructed networks. Because of the link-focused nature of these results, we expect the local methods to be relevant here, which include the OER measure, the ϕ coefficient, and the DF method. In our data set, there are 747 cases where the Alzheimer’s disease is diagnosed following myocardial infarction. All three methods deem this link as significant. There were 506 cases where, conversely, Alzheimer’s disease preceded. Only the DF method deemed this direction of the link as significant, which indicates that myocardial infarction attracts a significant share of the out-strength of the Alzheimer’s disease. This implies that conditional on having developed the Alzheimer’s disease, there is an elevated chance of later developing myocardial infarction. There were 38 cases were autonomic nervous system disorder is diagnosed prior to the carpal tunnel syndrome, and OER and ϕ deem this link as significant. There are 47 cases were the carpal tunnel syndrome precedes, and neither of the methods deem this direction of the link as significant. This highlights the strength of the directed characterization of the network over the undirected versions considered in the literature, because in addition to association between disease pairs, the distinction between the statistical properties of the two directions sheds light on which of the two diseases is more probable to cause the other, or at least to precede the other in the causal network that subsumes them both besides other covariates.

Table 17 pertains to disease pairs that have OER>1.5 and are related via shared PPI or genes as deemed by Ref. Park et al. (2009). The table presents the percentage of such disease pairs that are deemed significant by different constructed networks. As expected, OER and ϕ have the best performance, and DF is also performing well. These three methods have a local focus, therefore, they are potent in detecting such link-based relations. The other methods, however, focus more on the global structure of the network, and as Table 17 shows, have poor performance for detecting such disease pairs, which matches the expectation.

Table 17 Percentage of disease pairs retained by different constructed networks whose gene or PPI commonality are deemed significant by Ref. Park et al. (2009)

Example application: negative comorbidity and protective effects

As mentioned above, the ϕ coefficient and the OER measure can capture negative comorbidities, that is, cases where developing disease A is negatively associated with developing disease B. We can use the ϕ correlation to detect strong negative comorbidities, minding that distinguishing actual protection effects from mere negative associations obviously requires more rigorous causal analysis and is not within the scope of this paper. Throughout this paper, we only investigate associations. Here we consider several existing examples in the literature where such negative association has been suggested, and to check if our data replicates these findings. A famous example is the negative association between Alzheimer’s disease and various types of cancer (see Ref. Ma et al. (2014); Tabarés-Seisdedos et al. (2011) and references therein). We focus on the ICD9 codes 140 to 239, which pertain to the neoplasms category, and investigate how the ϕ and OER methods perform at capturing negative comorbidities. For the OER method, we look for significant links with OER < 1, and for the ϕ method, we look for significant links with ϕ<0. The ϕ method gives 29 distinct neoplasms exhibiting significant negative comorbidity with Alzheimer’s disease. The OER method detects 26 of those 29 links, and finds no links that the ϕ method had not. From these two sets, 27 neoplasms are deemed as significant by both methods. In the converse direction, where the initial diagnosis of a neoplasm is associated with reduced risk of subsequent Alzheimer’s disease, the ϕ method detects 30 neoplasms, and the OER detects a subset of them comprising 28 neoplasms. For the Alzheimer’s-neoplasm direction of negative comorbidities, the strongest negative association that the ϕ method returns pertains to secondary/unspecified malignant neoplasm of lymph nodes, and for the OER method, the strongest negative association pertains to benign neoplasm of kidney. In the neoplasm-Alzheimer’s direction, the first rank for the ϕ method is secondary malignant neoplasm of respiratory and digestive systems, and for the OER method the first rank belongs to malignant neoplasm of gallbladder and extra-hepatic bile ducts.

Example application: pregnancy-related codes

As discussed above, pregnancy-related codes are among the highly-comorbid codes in several constructed networks. We can use the different constructed networks to investigate these comorbidities. Most codes in the 630-679 range are highly connected to one another. Here we focus on comorbidity with codes outside this category. We consider ICD9 code 634 (spontaneous abortion) as an example. In the OER network, the diseases with the highest correlations that tend to precede 634 include 792 (nonspecific abnormal findings in other body substances), 282 (hereditary hemolytic anemias) and 218 (uterine leiomyoma). These links are also deemed significant by the ϕ method. These are in agreement with previous results in the literature (Serjeant et al. 2004; Pajor et al. 1993; Powars et al. 1986; Coronado et al. 2000; Klatsky et al. 2008). The only link OER, ϕ, DF, and IPFP methods all agree on is 792. Figure 7 shows the local graph of pregnancy related codes. As stated above, pregnancy-related codes are highly cohesive and each of them connects to many others. For better visualization, we sparsified the neighborhood networks as follows. For each node, we only retained the two pregnancy-related neighbors with highest weights and the two non-pregnancy codes with highest weights. In Fig. 7, the blue links are between pregnancy codes and the orange links are between pregnancy and non-pregnancy codes. We used the ϕ network. to generate this graph. Highly-connected nodes in this subnetwork are iron deficiency anemia and other types of anemias (Serjeant et al. 2004; Pajor et al. 1993; Powars et al. 1986) and Cholelithiasis (Ács et al. 2009; Dixon et al. 1987).

Fig. 7
figure 7

Local network of pregnancy-related codes. The neighborhood of the pregnancy-related codes 630-679 as deemed significant by the ϕ method. The blue links are between pregnancy codes and the orange links are between pregnancy and non-pregnancy codes. Nodes sizes are proportional to in-strengths

Example application: insight from coarse-grained networks

We now construct a coarse-grained picture of the disease network based on the standard 17 categories of the ICD9 coding scheme (World Health Organization 2004). The list of the 17 categories are presented in Table 18, along with the number of 3-digit ICD9 codes contained within each category, percentage of 3-digit ICD9 codes contained within each category, number of diagnoses in the data set that pertain to diseases within each category, and the percentage of such diagnoses.

Table 18 Partitioning the diseases into 17 categories according to the ICD9 codes to obtain a coarse-grained characterization of the disease networks
Table 19 The percentage of the total link strength of the constructed networks that flows into and out of each of the 17 disease categories in the coarse-grained network
Table 20 The sum of link weight and the percentage of total link weight of the constructed networks that fall within each of the 17 disease categories in the coarse-grained network

The network properties of the 17-node coarse-grained networks are summarized in Table 19, which presents the percentage of the total link weight that flows into and out of each category. Table 20 presents the percentages of the total link weight of the network that falls within each disease category, that is, pertaining to links that connect two nodes that both belong to the same disease category. In the raw network, the highest self-flow belongs to category 7 (disease of the circulatory system), with almost 4% of the total link weight contained inside it. In every method except OER and IPFP, a comparatively high fraction of the total link weight of the network flows within category 7. This category includes, most notably, hypertension, cardiovascular disease, and Ischemic heart disease.

The second highest self-flow in the raw network belongs to category 11 (complications of pregnancy, childbirth, and the puerperium). The OER network changes some of the self-flows markedly. In the OER network, category 11 has an outstandingly high self-flow. Its self-flow is almost the same as the self-flow of every other category combined. So the OER method assigns a high value to disease pairs that both belong to this category. The IPFP method also assigns an outstandingly high self-flow to this category. So, controlling for prevalence, diagnosis pairs that both belong to category 11 typically receive a high weight in the IPFP network. The Salience method assigns the most outstanding self-flow to category 11. In the Salience network, self-flow of category 11 exceeds the self-flows of every other category combined. This means that these diseases are highly central in the disease-disease trajectories, and the links between these diseases contribute to many shortest paths between every disease pair in the whole network.

In Table 21, we present the rankings of the disease categories in terms of what fraction of the total link weight of the network flows out of each category. Except for the IPFP network, the highest out-strength of all networks either belongs to category 11 (complications of pregnancy, childbirth, and the puerperium) or category 7 (disease of the circulatory system). Moreover, category 3 (endocrine/nutritional/metabolic/immunity disorders) is consistently high in out-strength across different networks.

Table 21 Ranking of disease categories in terms of out-strength in the constructed networks

Example application: comorbidity with the neoplasm category

Many studies in the literature have demonstrated comorbidity patterns between different neoplasms and various other diseases (Mazza and Mitchell 2017; van Baal et al. 2011; Piccirillo et al. 2004; Piccirillo 2000; Fleming et al. 1999; West et al. 1996; Yancik et al. 1996). We can use the coarse-grained networks to investigate neoplasm-related comorbidity patterns. Since we are conditioning on neoplasm being the first diagnosed disease in comorbidity pairs, we can use the DF network. In the coarse-grained DF network, the out-links with the highest weights emanated from the neoplasm category are those to node 7 (diseases of the circulatory system, with 21% of the total link weight of the network), and node 3 (endocrine/nutritional/metabolic/immunity disorders, with 10% of the total link weight). Same is true for the OER and ϕ networks; the top two disease categories that follow neoplasms are 7 and 3.

Conclusion and future work

In this paper we provided a brief summary of some of the main existing methods in the network science literature that could be utilized to construct disease comorbidity networks from longitudinal hospital data. We showed that these methods capture different aspects of the comorbidity patterns, and one must note their properties and choose them according to the structural feature of interest. We presented several examples of the applications of these methods in studying different comorbidity relations, both for single diseases and disease groups. Methodological work in this domain is inchoate, similar to the field of network medicine itself. So there are many interesting unexplored problems of practical significance. Below we highlight a few of such problems.

As discussed above, there are many cases in which a patient visits the hospital and multiple diseases are diagnosed and stored in the data set for the same visit. The temporal direction of such links is lost. We chose to discard such links and refrained from introducing noise to the data set by counting them as bidirectional. An interesting problem would be to infer the direction of these undirected links using edge recovery algorithms (Martin et al. 2016) (not to be confused with link prediction, where the task is to predict the existence of empirically-absent links). which can also be done as a byproduct of community-detection algorithms (Martin et al. 2016). One has to first devise an inference method which is applicable to weighted directed graphs.

In the construction of the networks discussed in the text, we did not use the covariates, such as age and gender. The interplay between such covariates and structure is worth serious investigation. Manual investigation, such as separating the data for different sexes (such as in Ref. Chmiel et al. (2014); Jensen et al. (2014)) does provide insight, yet a systematic and algorithmic approach would be an interesting research problem. For example, it would be interesting to formulate the macro/meso properties of the comorbidity network, and the local properties of individual diseases, as a function of age. This would be an example of link metadata: a disease would be connected to another disease via multiple links whose metadata (age/sex) are different. In this case, unlike what we did in the present paper, one must not aggregate all the link weights into a single weight. Though methods that incorporate nodal metadata exists (for example, for community detection (Newman and Clauset 2016; Peel et al. 2017)), we are not aware of a systematic investigation for link-based metadata. A simpler approach for the age variable would be to divide the age variables into discrete categories to construct a multiplex network, where each layer represents an age group. Then methods for multiplex network analysis can be applied (e.g., those of community detection (De Bacco et al. 2017)).

Perhaps more important would be to investigate multi-morbidity patterns. This would be the first step towards controlling for age. That is, if instead of comorbidity pairs, we limit the analysis to sets of, say, four diseases, the analysis would differ. We would look for the number of instances that the ordered set of diseases ABCD have been diagnosed in the dataset. This probably evinces a more meaningful connection between these diseases, as opposed to only considering pairs separately. In other words, if there are many instances of ABCD, this is probably more informative regarding the linkages of these diseases as compared to observing many AB and BC and CD pairs separately. Considering longer chains in such a manner is more likely to capture actual disease progression pathways, because it reduces the likelihood that, for example, A simply happens to be a disease that disproportionately occurs in early ages, and D at later ages.

Another interesting direction forward would be to incorporate death. One could add death as a new disease. We would obviously have \(s_{\text {death}}^{\text {out}}=0\), but it would be instructive to study the in-links and the position of death in the disease network. More particularly, different diseases can be characterized as their distance to this reference node. The first step towards incorporating death into the disease network would be to check the relative network distance of certain diseases, or disease categories, to the death node, which characterizes how deadly those diseases or disease categories are. This also enables focusing the analyses on diseases that tend to appear late, and to characterize their relation to death, and to investigate if they have special properties in this regard.

In addition to death, a more complete analyses would require the addition of a ‘noise’ node, which would represent unknown causes. Because currently, when a disease is diagnosed without any predecessor, it receives no in-link. This limits the analytical power to investigate the importance of unknown causes, which characterizes the likelihood that a healthy person would enter the disease comorbidity network. In other words, the inflow of the network (which represent new patients) all enter through this root node, whence they flow throughout the rest of the network.

Finally, we remark upon the importance of distinguishing between the structural features of acute and chronic diseases in the comorbidity patterns. The temporal profile of comorbidity of disease A, if A is a chronic disease, looks like: {A}→{A,B}→{A,B,C}→…, because chronic diseases tend to persist, by definition. An alternative pattern could be {A}→{A,B}→{A,C}→…. On the contrary, if A is an acute disease, then the temporal pattern of comorbidities would look like {A}→{B}→{C}→…. These patters presumably undergird different mechanisms of comorbidity, because there is a difference between having a certain disease and having a history of it. Thus a worthwhile problem to study would be the algorithmic characterization of the chronic-ness of diseases and the persistence of their damages, based on their structural properties in the comorbidity network.



Census metropolitan area


Disparity filter


Global statistical significance


International classification of diseases


Iterative proportional fitting procedure


Observed-to-expected ratio


Protein-protein interaction


Relative risk


  • Ács, N, Bánhidy F, Puhó EH, Czeizel AE (2009) Possible association between symptomatic cholelithiasis-complicated cholecystitis in pregnant women and congenital abnormalities in their offspring—a population-based case–control study. Eur J Obstet Gynecol Reprod Biol 146(2):152–155.

    Article  Google Scholar 

  • Bonacich, P (1987) Power and centrality: A family of measures. Am J Sociol 92(5):1170–1182.

    Article  Google Scholar 

  • Bouma, G (2009) Normalized (pointwise) mutual information in collocation extraction In: Proceedings of Biannial GSCL Conference, 31–40, Postdam.

  • Brabin, BJ, Hakimi M, Pelletier D (2001) An analysis of anemia and pregnancy-related maternal mortality. J Nutr 131(2):604–615.

    Article  Google Scholar 

  • Brin, S, Page L (1998) The anatomy of a large-scale hypertextual web search engine. Comput Netw ISDN Syst 30(1–7):107–117.

    Article  Google Scholar 

  • Chakrabarty, D, Khanna S (2018) Better and simpler error analysis of the sinkhorn-knopp algorithm for matrix scaling. arXiv preprint arXiv:1801.02790.

  • Chmiel, A, Klimek P, Thurner S (2014) Spreading of diseases through comorbidity networks across life and gender. New J Phys 16(11):115013.

    Article  Google Scholar 

  • Clauset, A, Shalizi CR, Newman ME (2009) Power-law distributions in empirical data. SIAM Rev 51(4):661–703.

    Article  ADS  MathSciNet  Google Scholar 

  • Coronado, GD, Marshall LM, Schwartz SM (2000) Complications in pregnancy, labor, and delivery with uterine leiomyomas: a population-based study. Obstet Gynecol 95(5):764–769.

    Google Scholar 

  • Davenport Jr, EC, El-Sanhurry NA (1991) Phi/phimax: review and synthesis. Educ Psychol Meas 51(4):821–828.

    Article  Google Scholar 

  • De Bacco, C, Power EA, Larremore DB, Moore C (2017) Community detection, link prediction, and layer interdependence in multilayer networks. Phys Rev E 95(4):042317.

    Article  ADS  Google Scholar 

  • Desai, M, Ter Kuile FO, Nosten F, McGready R, Asamoa K, Brabin B, Newman RD (2007) Epidemiology and burden of malaria in pregnancy. Lancet Infect Dis 7(2):93–104.

    Article  Google Scholar 

  • Dixon, NP, Faddis DM, Silberman H (1987) Aggressive management of cholecystitis during pregnancy. Am J Surg 154(3):292–294.

    Article  Google Scholar 

  • Everitt, BS (1992) The Analysis of Contingency Tables. Chapman and Hall/CRC, London.

    MATH  Google Scholar 

  • Fleming, ST, Rastogi A, Dmitrienko A, Johnson KD (1999) A comprehensive prognostic index to predict survival based on multiple comorbidities: a focus on breast cancer. Med Care 37(6):601–614.

    Article  Google Scholar 

  • Folino, F, Pizzuti C, Ventura M (2010) A comorbidity network approach to predict disease risk In: Information Technology in Bio-and Medical Informatics, ITBAM 2010, 102–109.. Springer, Heidelberg.

    Chapter  Google Scholar 

  • Fortunato, S, Boguñá M, Flammini A, Menczer F (2006) Approximating pagerank from in-degree In: International Workshop on Algorithms and Models for the Web-Graph, 59–71.. Springer.

  • Goh, K-I, Cusick ME, Valle D, Childs B, Vidal M, Barabási A-L (2007) The human disease network. Proc Natl Acad Sci 104(21):8685–8690.

    Article  ADS  Google Scholar 

  • Grady, D, Thiemann C, Brockmann D (2012) Robust classification of salient links in complex networks. Nat Commun 3:864.

    Article  ADS  Google Scholar 

  • Guilford, JP (1965) The minimal phi coefficient and the maximal phi. Educ Psychol Meas 25(1):3–8.

    Article  Google Scholar 

  • Halu, A, De Domenico M, Arenas A, Sharma A (2017) The multiplex network of human diseases. bioRxiv:100370.

  • Hidalgo, CA, Blumm N, Barabási A-L, Christakis NA (2009) A dynamic network approach for the study of human phenotypes. PLoS Comput Biol 5(4):1000353.

    Article  Google Scholar 

  • James, AH, Bushnell CD, Jamison MG, Myers ER (2005) Incidence and risk factors for stroke in pregnancy and the puerperium. Obstet Gynecol 106(3):509–516.

    Article  Google Scholar 

  • Jensen, AB, Moseley PL, Oprea TI, Ellesøe SG, Eriksson R, Schmock H, Jensen PB, Jensen LJ, Brunak S (2014) Temporal disease trajectories condensed from population-wide registry data covering 6.2 million patients. Nat Commun 5:4022.

    Article  ADS  Google Scholar 

  • Jeong, E, Ko K, Oh S, Han HW (2017) Network-based analysis of diagnosis progression patterns using claims data. Sci Rep 7(1):15561.

    Article  ADS  Google Scholar 

  • Katz, D, Baptista J, Azen S, Pike M (1978) Obtaining confidence intervals for the risk ratio in cohort studies. Biometrics 34(3):469–474.

    Article  Google Scholar 

  • Kittner, SJ, Stern BJ, Feeser BR, Hebel JR, Nagey DA, Buchholz DW, Earley CJ, Johnson CJ, Macko RF, Sloan MA, et al. (1996) Pregnancy and the risk of stroke. N Engl J Med 335(11):768–774.

    Article  Google Scholar 

  • Klatsky, PC, Tran ND, Caughey AB, Fujimoto VY (2008) Fibroids and reproductive outcomes: a systematic literature review from conception to delivery. Am J Obstet Gynecol 198(4):357–366.

    Article  Google Scholar 

  • Kleinberg, JM (1999) Authoritative sources in a hyperlinked environment. J ACM 46(5):604–632.

    Article  MathSciNet  Google Scholar 

  • Lee, D-S, Park J, Kay K, Christakis N, Oltvai Z, Barabási A-L (2008) The implications of human metabolic network topology for disease comorbidity. Proc Natl Acad Sci 105(29):9880–9885.

    Article  ADS  Google Scholar 

  • Liu, YI, Wise PH, Butte AJ (2009) The“ etiome”: identification and clustering of human disease etiological factors. BMC Bioinformatics 10(2):14.

    Article  Google Scholar 

  • Ma, L-L, Yu J-T, Wang H-F, Meng X-F, Tan C-C, Wang C, Tan L (2014) Association between cancer and alzheimer’s disease: systematic review and meta-analysis. J Alzheimers Dis 42(2):565–573.

    Article  Google Scholar 

  • Martin, T, Ball B, Newman ME (2016) Structural inference for uncertain networks. Phys Rev E 93(1):012306.

    Article  ADS  Google Scholar 

  • Mazza, D, Mitchell G (2017) Cancer, ageing, multimorbidity and primary care. Eur J Cancer Care 26(3):12717.

    Article  Google Scholar 

  • Menche, J, Sharma A, Kitsak M, Ghiassian SD, Vidal M, Loscalzo J, Barabási A-L (2015) Uncovering disease-disease relationships through the incomplete interactome. Science 347(6224):1257601.

    Article  Google Scholar 

  • Newman, ME, Clauset A (2016) Structure and inference in annotated networks. Nat Commun 7:11863.

    Article  ADS  Google Scholar 

  • Pajor, A, Lehoczky D, Szakacs Z (1993) Pregnancy and hereditary spherocytosis. Arch Gynecol Obstet 253(1):37–42.

    Article  Google Scholar 

  • Park, J, Lee D-S, Christakis NA, Barabási A-L (2009) The impact of cellular networks on disease comorbidity. Mol Syst Biol 5(1):262.

    Google Scholar 

  • Peel, L, Larremore DB, Clauset A (2017) The ground truth about metadata and community detection in networks. Sci Adv 3(5):1602548.

    Article  ADS  Google Scholar 

  • Piccirillo, JF (2000) Importance of comorbidity in head and neck cancer. Laryngoscope 110(4):593–602.

    Article  Google Scholar 

  • Piccirillo, JF, Tierney RM, Costas I, Grove L, Spitznagel Jr EL (2004) Prognostic importance of comorbidity in a hospital-based cancer registry. Jama 291(20):2441–2447.

    Article  Google Scholar 

  • Powars, DR, Sandhu M, Niland-Weiss J, Johnson C, Bruce S, Manning PR (1986) Pregnancy in sickle cell disease. Obstet Gynecol 67(2):217–228.

    Article  Google Scholar 

  • Radicchi, F, Ramasco JJ, Fortunato S (2011) Information filtering in complex weighted networks. Phys Rev E 83(4):046101.

    Article  ADS  Google Scholar 

  • Schütze, H, Manning CD, Raghavan P (2008) Introduction to Information Retrieval vol. 39. Cambridge University Press, UK.

    MATH  Google Scholar 

  • Serjeant, GR, Loy LL, Crowther M, Hambleton IR, Thame M (2004) Outcome of pregnancy in homozygous sickle cell disease. Obstet Gynecol 103(6):1278–1285.

    Article  Google Scholar 

  • Serrano, MÁ, Boguná M, Vespignani A (2009) Extracting the multiscale backbone of complex weighted networks. Proc Natl Acad Sci 106(16):6483–6488.

    Article  ADS  Google Scholar 

  • Sinkhorn, R, Knopp P (1967) Concerning nonnegative matrices and doubly stochastic matrices. Pac J Math 21(2):343–348.

    Article  MathSciNet  Google Scholar 

  • Slater, PB (2009a) A two-stage algorithm for extracting the multiscale backbone of complex weighted networks. Proc Natl Acad Sci 106(26):66–66.

    Article  ADS  Google Scholar 

  • Slater, PB (2009b) Multiscale network reduction methodologies: Bistochastic and disparity filtering of human migration flows between 3,000+ us counties. arXiv preprint arXiv:0907.2393.

  • Tabarés-Seisdedos, R, Dumont N, Baudot A, Valderas JM, Climent J, Valencia A, Crespo-Facorro B, Vieta E, Gómez-Beneyto M, Martínez S, et al. (2011) No paradox, no progress: inverse cancer comorbidity in people with other complex diseases. Lancet Oncol 12(6):604–608.

    Article  Google Scholar 

  • van Baal, PH, Engelfriet PM, Boshuizen HC, van de Kassteele J, Schellevis FG, Hoogenveen RT (2011) Co-occurrence of diabetes, myocardial infarction, stroke, and cancer: quantifying age patterns in the dutch population using health survey data. Popul Health Metrics 9(1):51.

    Article  Google Scholar 

  • West, DW, Satariano WA, Ragland DR, Hiatt RA (1996) Comorbidity and breast cancer survival: a comparison between black and white women. Ann Epidemiol 6(5):413–419.

    Article  Google Scholar 

  • World Health Organization (2004) International statistical classification of diseases and related health problems.

  • Wu, Z, Braunstein LA, Havlin S, Stanley HE (2006) Transport in weighted networks: partition into superhighways and roads. Phys Rev Lett 96(14):148702.

    Article  ADS  Google Scholar 

  • Xing, W, Ghorbani A (2004) Weighted pagerank algorithm In: Proceedings of the second annual conference on communication networks and services research CNSR 2004, 305–314.. IEEE, Fredericton, N.B.

    Chapter  Google Scholar 

  • Yancik, R, Havlik RJ, Wesley MN, Ries L, Long S, Rossi WK, Edwards BK (1996) Cancer and comorbidity in older patients: a descriptive profile. Ann Epidemiol 6(5):399–412.

    Article  Google Scholar 

  • Yıldırım, MA, Goh K-I, Cusick ME, Barabási A-L, Vidal M (2007) Drug—target network. Nat Biotechnol 25(10):1119–1126.

    Article  Google Scholar 

  • Zhou, X, Menche J, Barabási A-L, Sharma A (2014) Human symptoms–disease network. Nat Commun 5:4212.

    Article  Google Scholar 

Download references


This work was funded in part by the James S. McDonnell Foundation (BF, NM, and MAR).

Availability of data and materials

In order to access the data, we were required to obtain approval from our institutional review board and from the Commission d’accès à l’information du Québec. A condition of our access was that we are not permitted to share the data with other researchers. However, researchers can request access to the same data directly from the data holder, Régie de l’assurance maladie du Québec (RAMQ), by following the process outlined on their website: donnees-demande.aspx

Author information

Authors and Affiliations



All authors contributed to every aspect of the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Babak Fotouhi.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (, which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Fotouhi, B., Momeni, N., Riolo, M. et al. Statistical methods for constructing disease comorbidity networks from longitudinal inpatient data. Appl Netw Sci 3, 46 (2018).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: