Statistical methods for constructing disease comorbidity networks from longitudinal inpatient data

Fotouhi, Babak; Momeni, Naghmeh; Riolo, Maria A.; Buckeridge, David L.

doi:10.1007/s41109-018-0101-4

Research
Open access
Published: 07 November 2018

Statistical methods for constructing disease comorbidity networks from longitudinal inpatient data

Babak Fotouhi¹,
Naghmeh Momeni²,
Maria A. Riolo³ &
…
David L. Buckeridge⁴

Applied Network Science volume 3, Article number: 46 (2018) Cite this article

5260 Accesses
27 Citations
12 Altmetric
Metrics details

Abstract

Tools from network science can be utilized to study relations between diseases. Different studies focus on different types of inter-disease linkages. One of them is the comorbidity patterns derived from large-scale longitudinal data of hospital discharge records. Researchers seek to describe comorbidity relations as a network to characterize pathways of disease progressions and to predict future risks. The first step in such studies is the construction of the network itself, which subsequent analyses rest upon. There are different ways to build such a network. In this paper, we provide an overview of several existing statistical approaches in network science applicable to weighted directed networks. We discuss the differences between the null models that these models assume and their applications. We apply these methods to the inpatient data of approximately one million people, spanning approximately 17 years, pertaining to the Montreal Census Metropolitan Area. We discuss the differences in the structure of the networks built by different methods, and different features of the comorbidity relations that they extract. We also present several example applications of these methods.

Introduction

In the last decade, several network approaches have been introduced to study the interrelations between human diseases. Networks are constructed by connecting diseases that share certain features, collapsing a bipartite graph into a unipartite graph. Examples include genetic/interactomic association (Goh et al. 2007; Halu et al. 2017; Menche et al. 2015), similarity of symptoms (Zhou et al. 2014; Halu et al. 2017), similarity of pertinent drugs (Yıldırım et al. 2007), commonality of etiological environmental factors associated with diseases (Liu et al. 2009), adjacency of metabolic reactions catalyzed by corresponding mutated enzymes (Lee et al. 2008), and co-occurrence in patients (Hidalgo et al. 2009; Folino et al. 2010; Chmiel et al. 2014; Jensen et al. 2014; Jeong et al. 2017). Also sometimes more than one of these networks are juxtaposed to build a multiplex characterization (Halu et al. 2017). All of these strands of research are beneficial and insight-engendering in their respective contexts, and the increase in the breadth of topics and the diversity of approaches promises the emergence of a new field of research.

Here we focus on a methodological problem in this new field. We investigate different statistical methods for defining a weighted and directed co-morbidity network from longitudinal hospital in-patient data, and show that different methods capture different aspects of co-morbidity relations. We use a data set containing over a million people for a period of approximately 17 years, and employ different statistical methods to extract co-morbidity networks based on this data set.

Some of the previous studies have used a binary version of the comorbidity networks to study the structural properties of diseases (Hidalgo et al. 2009; Folino et al. 2010; Chmiel et al. 2014; Jeong et al. 2017). Measures for establishing unweighted binary links between disease pairs include the ϕ-correlation (which is closely linked to the χ² statistic) and relative risk (ratio of observed co-occurrence of a pair to the expected co-occurrence of a null model) (Hidalgo et al. 2009; Folino et al. 2010; Chmiel et al. 2014; Jeong et al. 2017). These methods capture useful information about co-morbidities, and also have drawbacks. The ϕ-correlation underestimates the associations in disease pairs in which one disease is rare and the other is prevalent. The relative risk tends to overestimate linkages between rare diseases and to underestimate those between prevalent diseases. To use any of these methods, one inevitably chooses trade-off parameters to construct the network with reasonable accuracy. Examples include the thresholds in Ref. Chmiel et al. (2014), the choice of relative risk cutoff (4 in Ref. Jeong et al. (2017) and 20 in Ref. Folino et al. (2010)), and the choice of defining “lop-sided”ness if one direction of a reciprocal link weights at least twice as the other direction (Jeong et al. 2017). These thresholds are chosen to be intuitively-reasonable values considering the respective settings.

In this paper, we study different systematic statistical methods for building weighted directed comorbidity networks. These methods use different criteria to deem statistical significance for links. The resulting networks are sparser than the raw network, and the links are in some sense adjudicated as meaningful, that is, non-noise. In addition to statistical considerations, working with sparser networks is easier both computationally and intuitively, and the ultimate goal of gaining insight about paths of disease progression is facilitated. Here we investigate the effect of the statistical procedure used to build a network from the disease co-occurrence data on the structure of the resulting network. We show that depending on the null model used for defining the statistical significance of disease-disease links, different aspects of the comorbidity patterns are captured, and the resulting networks can have different micro/meso structures, and the centrality/ranking measures of individual diseases can differ. We describe the networks built from each method, discuss their similarities and differences, and present several example applications using these constructed networks.

Data

Using the registry of all medically insured people in the province of Québec (fichier d’inscription des personnes assures - FIPA) we randomly sampled 25% of the people residing in the Montreal Census Metropolitan Area (CMA) in 1998. In each subsequent year, we used the FIPA to re-sample immigrants to the CMA and babies born to mothers residing in the CMA to maintain a representative, 25% sample for each year. For sampled individuals, we obtain regular data updates from the Régie de l’assurance maladie du Québec (RAMQ) on physician billing, drugs dispensed, hospitalization records, and death certificates. The data sets are linked with an anonymized unique identifier. At any given time, the dynamic cohort contains approximately 1 million people and follow-up data span approximately 17 years.

Moreover, in one of the applications that we present below, we use the dataset that is publicly available via Ref. Park et al. (2009) to connect our results to previous findings in the literature. In this data set, the protein–protein interaction (PPI) and coexpression networks and the inter-disease network of shared genes are linked to the comorbidity network derived from US Medicare claims of over 13 million elderly patients. The data set can be accessed online via http://msb.embopress.org/content/5/1/262.

The analyses reported in this paper has been conducted using MATLAB R2015b.

Network construction methods

ICD codes

We use the ICD9 coding scheme for the classification of diseases. To make the analysis more tractable, we confine the analysis to the 3-digit classification.

Network terminology

Throughout, the pathways of disease progression are modeled by a network, where nodes represent diseases and a link from node i to node j represents an instance of diagnosis of disease i followed by a subsequent diagnosis of disease j. We denote the number of connections of a node by its degree, denoted by k. The weight of the link from disease i to disease j is denoted by w_ij, which is equal to the number of times a diagnosis of disease i followed by a diagnosis of disease j is reported in the data set. By the strength (Serrano et al. 2009) of a node, denoted by s, we refer to the sum of the weights of its links. We use these for either directions of the links. For example, the ‘out-strength’ $s_{x}^{\text {out}}={\sum \nolimits }_{y} w_{xy}$ denotes the sum of the weights of the out-links of node x to other nodes, and the out-degree $k_{x}^{\text {out}}$ denotes the number of such out-links. The out-strength of a node is equal to the total number of times that the diagnosis of that disease was followed by the diagnosis of any other disease. The out-degree of a node is the number of distinct diseases that follow that particular disease, without counting the multiplicities. Similarly we can define the in-strength and in-degree for each node. We denote the sum of the strength of all links by S, that is, we have $S={\sum \nolimits }_{ij} w_{ij}$.

Raw network

In our data set, there are 1,700,000 distinct hospital visits, and the total number of unique ICD9-coded diagnoses is 6,500,000. Among all the hospital visits, 35.3% where given only one ICD9-coded diagnosis. Figure 1 presents the histogram of the number of ICD9-coded diagnoses per hospital visit. Table 1 presents the top 10 disease in the data set with highest prevalence. Figure 2 depicts the histogram of the prevalence of the diseases in our data set. The distribution of the logarithm of the prevalences is normal-like, but the result of the Kolmogorov-Smirnov test was that the normality assumption is rejected (on the 0.1 level). Though not strictly log-normal, the prevalence distribution is evidently heavy-tailed, that is, most diseases have low levels of prevalence and a minority of the diseases have extremely high levels of prevalence. The starting point of our analysis is to build a raw network, which will be the substrate on which other methods construct different derived networks.

Table 1 Top 10 most-prevalent diseases in our data set

Statistical methods for constructing disease comorbidity networks from longitudinal inpatient data

Abstract

Introduction

Data

Network construction methods

ICD codes

Network terminology

Raw network

Relative risk and observed-to-expected ratio

ϕ coefficient

Disparity filter

Iterative proportional fitting procedure

The GloSS filter

Link salience

Comparing networks

Overview of the function of different networks

Applicability of different methods

Example applications

Different measures for node importance

Example application: the role of disease prevalence

Example application: shared genes and protein-protein Interactions

Example application: negative comorbidity and protective effects

Example application: pregnancy-related codes

Example application: insight from coarse-grained networks

Example application: comorbidity with the neoplasm category

Conclusion and future work

Abbreviations

References

Funding

Availability of data and materials

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Competing interests

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords