Predicting onset of complications from diabetes: a graph based approach.

Diabetes is a significant health concern with more than 30 million Americans living with diabetes. Onset of diabetes increases the risk for various complications, including kidney disease, myocardial infractions, heart failure, stroke, retinopathy, and liver disease. In this paper, we study and predict the onset of these complications using a network-based approach by identifying fast and slow progressors. That is, given a patient's diagnosis of diabetes, we predict the likelihood of developing one or more of the possible complications, and which patients will develop complications quickly. This combination of "if a complication will be developed" with "how fast it will be developed" can aid the physician in developing better diabetes management program for a given patient.


Introduction
Diabetes is a significant public health concern in the United States. According to the Center for Disease Control (CDC), in 2015 it was estimated that 30.3 million people have diabetes, with 23.1 million cases diagnosed and 7.2 million undiagnosed (for Disease Control et al. 2017). 90 to 95 percent of those cases are Type 2 (for Disease Control et al. 2017), which is the group that we will focus on throughout this paper. Complications (co-morbidities) related to Type 2 Diabetes Mellitus (T2DM) are the key drivers of the health impact and cost of this chronic disease. The vast majority of diabetics will experience a complication from their disease (Nickerson and Dutta 2012). Recent data shows that there were 7.2 million hospital discharges reported for people with diabetes in 2014 (for Disease Control et al. 2017). Further, diabetes was ranked as the seventh leading cause of death in the United States in 2015, with the total direct and indirect cost of diagnosed diabetes in 2012 at 245 billion dollars (for Disease Control et al. 2017). It is critical to not only diagnose the onset of diabetes but also predict the onset of complications (co-morbidities), which would better assist in long-term care management, and better health and wellness for the patients.
To achieve the objective of predictability of onset of complications, we first represent a patient's disease history as a network based on what happens in the second year after a diabetes diagnosis. Genetic determinants and other independent accelerating factors of the complications of diabetes (Brownlee 2005) clearly establish the basis for these comorbid conditions developing over time. Furthermore, we label patients as either slow or fast progressors in developing complications arising from diabetes, thus developing sub-networks of disease evolution.
The proposed network developed in this study will not only provide a useful modeling construct but also a mechanism for visualizing disease complications. The use of networks to understand disease progression has been studied before, such as in Alzheimer's (Wilkosz et al. 2010) and heart failure (Nagrecha et al. 2017). However, the novelty of our approach lies in the consideration of a heterogeneous network that includes nodes for disease diagnoses, tests, demographics, etc. Through the proposed networks-based approach, physicians will be able to leverage the combined experiences of other diabetics to determine how their patients' disease will progress. Pinpointing the risks of complication is of utmost importance for recognizing possible interventions in treatments that have the potential to delay or stop further progression.
We use a large data set comprising of Type 2 Diabetes patients in Indiana, collected over 20 years obtained through the Regenstrief Institute. This data includes both diagnosis codes taken from the International Statistical Classification of Diseases and Related Health Problems, Ninth Revision and Tenth Revision, (ICD-9 and ICD-10, respectively) and clinical laboratory test results. Researchers have had success using ICD codes to predict future disease states (Davis et al. 2010). We create networks of shared patient experiences using the sub-networks of patients and then identify common groupings of disease that have the greatest propensity of developing diabetic complications. Using both diagnoses and lab results as the nodes and edges in our network we identify those results that are most predictive of diabetic complications, thereby creating a multi-plex or heterogeneous network (Kivelä et al. 2014). This analysis allows us to answer the question: which patients are most at risk for developing what complications? We group patients into two categories -fast or slow progressors, based on whether they develop complications more quickly or more slowly than 25 percent of the population, respectively. By categorizing patients into these categories, a more efficient intervention mechanism can be developed. It also allows us to study, as future work, why certain patients are fast or slow progressors, leading to personalized interventions and treatments and improved patient outcomes.

Methods
Predicting diabetic complications is incredibly challenging due to the inequality of healthcare consumption and the speed at which patients receive diagnoses. In our work, we posit that by establishing appropriate thresholds and choosing balanced populations, we can ensure that even patients who infrequently visit their physician can still benefit from our models.

Data description
The Regenstrief Institute created one of the earliest electronic medical record systems in 1972 to support research and continues to handle the research use of the INPC (Indiana Network for Patient Care) database (JM Overhage and McDonald 1995). With the creation of the Indiana Health Information Exchange (IHIE) in 2004 to handle the exchange of data between Indiana's major healthcare provider systems, the availability of data for Indiana patients within INPC has greatly increased providing a key resource to drive research using "real world environment" observations and data.
In a collaboration of Indiana Biosciences Research Institute, Regenstrief Institute, and industrial partners, a primary data set of type 2 diabetes mellitus (T2DM) patients was created. Using inclusion criteria of one T2D diagnosis code OR a laboratory glycated hemoglobin (HbA1C) test results ≥ 6.5% OR at least one Medi-Span-defined anti-diabetes medication where the patients were ≥ 18 years of age on date of first inclusion criteria. Using this criteria, a primary T2DM cohort of 805,867 individuals was identified from INPC over 20 years . The demographics, diagnosis codes, medical procedures, prescriptions, and results from clinical laboratory tests were extracted for these individuals (Schleyer 2016). This extracted data resulted in over 500 million records that was available for analysis. This T2DM data set was then extensively cleaned and normalized to prepare for the analyses as per the diagram in Fig. 1.
To clean this T2DM data set, the extracted INPC data placed on a secure Amazon Web Services (AWS) server. This large T2DM dataset across 20 years was multi-modal and there were many missing parameters across the records, as well as inconsistency in the measurements identified by error codes or per-patient longitudinal analysis or out of range values. In addition, we had to take into account the correction of features that were reported for quality control (QC) checks. To that end, we implemented a comprehensive a data cleaning framework to normalize the features, remove bad or missing values, and have consistent units of measure was done using PySpark. The feature values were normalized and extreme values were identified and filtered on minimum and maximum values ever measured for a parameter. Additionally, if any values were +/-2 standard deviations from the median, they were filtered. Also, we looked for more than two distribution patterns in the data where potentially two different units of measure were applied to the same variable, which could indicate a problem with poor previous data integration. After this extensive effort to clean all the issues from this "real-world" captured data set from INPC, an "analysis-ready" data set was created for the modeling. An overview of the size of the different data tables is given in Table 1.
We use the following to categorize primary T2DM diagnoses and complications: • • Heart failure -defined as ICD9/ICD10 codes 428, I50 Fig. 1 Above is a flowchart depicting the cleaning and standardization process which (1) combines and QCs the raw data files, (2) combines variables, standardizes, and cleans using a dictionary specific to the data source, and finally (3) removes variable outliers and normalizes using a universal clinical parameter dictionary We further sample to create the following data about patients: patient diagnosis, which contains all the diagnoses codes (ICD-9/ICD-10) received by a patient, demographics, which contains age, gender, and race/ethnicity information, and clinical variables, which contains metabolic measurements taken while at the doctor's office. Header files for the diagnosis table is given in Table 2, patient data is given in Table 3, and clinical variables is given in Table 4. The number of patients who were diagnosed with each complication is given in Table 5.

Building disease diagnoses graphs
We detail the network construction in Algorithm 1, and network pruning in Algorithm 2. We retain a listing of the edges and nodes that represent the fast paths to diabetic complications, along with the nodes that result in the largest information gain.
Algorithm 1 For each patient we go through their disease history and add nodes and edges connecting information regarding measures of patient health. Each node and edge will have an attribute which corresponds to how many patients belong to that node and edge. Every i and j correspond to an entry in that patient's health record procedure CREATING THE NETWORK N ← empty network for p in all patient networks do for i in patient disease history do for j in patient disease history after i do if i in N then: Algorithm 2 After the network is generated, we test to see which edges pass the twosized Z-test by comparing how many patients are fast and slow progressors on each edge. If the edge's Z-score's absolute value is not above 1.96, it is pruned from the graph. After pruning, the Z-score is added as an attribute to the nodes and edges remaining in the graph procedure PRUNING THE NETWORK N ← empty network for p in all patient networks do: if p is a fast progressor then for all nodes n and edges e in that patient network: do n fast progressor = n fast progressor + 1 for node n i in p e fast progressor = e fast progressor + 1 for edge e i in p else for all nodes n and edges e in that patient network: do n slow progressor = n slow progressor + 1 for node n i in p e slow progressor = e slow progressor + 1 for edge e i in p for n nodes in N do: n z = Z-test(n fast progressor , n slow progressor ) There are three primary data sources that we use to build our models: patient demographic data, which remains constant throughout the duration of the study and is represented by nodes at the beginning of the network at time zero; patient diagnosis, which contains all the diagnoses that occur over the course of a patient's visit with a doctor or healthcare provider; and clinical variables, which contain all the available measurements and laboratories tests available in the patient's health records as contained in INPC.
We tested the following clinical variables and grouped them into quartiles, which were included in the clinical variables file: non high-density lipoprotein cholesterol (Non-HDL C), low-density lipoprotien (LDL) high-density lipoprotein (HDL) ratio, thyroid-stimulating hormone (TSH), fibrosis-4 (Fib 4) index, total cholesterol, lowdensity lipoprotein cholesterol (LDL C), high-density lipoprotein cholesterol (HDL C), cholesterol ratio, total bilirubin, basophil platlet count (PC), monocyte count, aspartate transaminase to platelet ratio index (APRI), neutrophil count, albumin, alkaline phosphatase (ALP), aspartate transaminase (AST) alanine transaminase (ALT) ratio. eosinophil PC, protein, HbA1C, ALT, estimated glomerular filtration rate (eGFR), AST, lymphocyte PC, calcium, red blood PC, platelet count, mean corpuscular volume (MCV), Diagnoses can appear in subsequent visits. Day 0 is the day that a type 2 diabetes diagnosis was received mean corpuscular hemoglobin (MCH), glucose, blood urea nitrogen (BUN), chloride, creatinine, and carbon dioxide (CO2). Additionally included in the clinical variables file were the following variables, preprocessed into normal and abnormal statuses: weight classification, HDL C, high serum creatinine, high urine glucose, hyperglycemia, hypertension, hypertriglyceridemia, impaired fasting glycemia (IFG), impaired glucose tolerance (IGT), LDL C, and triglycerides. Finally, we also quartile the age of the patients so that we have large groups to test on. Then every piece of information in a patient history is linked all other nodes, thus creating a heterogeneous network. An example of the network is given in Fig. 2.
After building the network, we prune it by discarding any edges that do not contain statistically significant differences between the fast and slow progressors as defined by using a two-proportion Z test score.
To determine if a patient is a slow or fast progressor, the nodes and edges of the sub-network that match the patient's medical history are traversed and their individual probability of developing a complication is computed. We assume that the node and edge weights, corresponding to the percentages of patients who suffer from that complication that are contained by that node or edge, are equally likely and statistically independent. These weights are multiplied together to get the probability of being a fast progressor. To decrease noise, we experimentally concluded that the weights, or percent likelihood of developing the specific complication of diabetes, corresponding to the top 12 most significant edges and nodes are used as determined by the two-proportion Z-test. In other words, for each individual patient, we only used the most significant parts of their individual network to predict whether or not that patient was a fast or slow progressor. The average AUC values from each of these experiments is shown in Table 6 and Fig. 3. The  weight that corresponds to the lowest probability of developing complications is removed since it was observed that removing this weight boosts the signal of the nodes and edges that result in fast progression of disease. The pruning process can be shown by referring to Fig. 2. The method to compute the probability that an individual will be a fast or slow progressor is: Let w 0 , ..., w n correspond to the n most significant edge and node weights as determined by the two-proportion Z-test, where n ≤ 12. Remove w h from the computation, which corresponds to the lowest probability of developing the complication. Let p t = n i=0 w i , and p f = n i=0 (1 − w i ). Then, the probability that a particular patient is a fast or slow progressor is p t p t +p f

Data cleaning
Only information in patient history that occurred in the second year following a Type 2 diabetes diagnosis is considered. Healthy patients survive longer than sickly ones, so if we extend our analysis for too long after a diabetes diagnosis, the data will become biased towards healthy patients. Patients tend to move and change doctors, and analyzing what occurs in the second year after the diagnosis will ensure that many patients are still in the system. We can see in Fig. 4 that many complications of diabetes occur early, so

Fig. 2
Above is an example of a patient network which contains demographic information, lab results, and diagnoses codes, for a patient who develops heart failure as a fast progressor. The most significant edges and nodes, as determined by the two-sided Z-test, marked in red, are used in patient risk calculation. Circles represent ICD diagnoses, hexagons demographic information, and squares clinical variables. Age and clinical variables had been quartiled such that 3.0 Age represents a patient whose age is in the top 75 percent of patients, and where 1.0 eGFR represents someone whose eGFR is between 25-50 percent when compared to the patient population it is acceptable to limit our analysis to the that year. Our "fast progressors" all develop complications within two years of a diabetes diagnosis. Only the second year is important to us. We do not consider what occurs in the first year after diagnosis because we want to introduce more stability into our data, to exclude patients who might be in an emergencyroom type situation when diagnosed. We only consider new diagnoses that occur after a diabetes diagnosis. We do not consider diagnoses or lab values that occurred before the type 2 diabetes diagnosis. Incorporating past values might be included in future work.  Table 6 • Diagnoses are truncated to the first three digits of the ICD-9 or ICD-10 code to remove the disease subtypes and only focus on the primary diagnoses. • All nodes that are not shared by at least one percent of the population are removed.

Fig. 3 Above is a graph of the values shown in
• All patients that have received less than five diagnoses or more than twice the median amount of diagnoses are removed. This assists with biases introduced by individuals having an excessive medical history or too few observations. • The cleaned dataset is sampled to ensure that our fast and slow progressors have the same number of patients.
• The significance on the edges is computed and any edges that do not test for a two-proportion z-test with 95 percent confidence are removed.
• Fast progressors are defined as patients who develop a complication of diabetes faster than 75 percent of the population. All patients from our dataset who develop the complication before being diagnosed with diabetes, or up to one year afterwards are removed.
• Slow progressors are defined as patients who develop a complication of diabetes slower than 75 percent of the population. Everyone retained in our network is eventually diagnosed with the complication which assists in making sure the datasets are balanced and with limited bias.
• Every node and every edge is given a Z-score, which corresponds to the likelihood of a significant difference between fast and slow progressors. Every node and edge will be given the percent likelihood that a patient who has the condition given in the node, or combination of conditions as represented by an edge, will be a fast or slow progressor.

Results
Our test set contained 20 percent of our patients. The percent likelihood of their complication development was computed against the patient network generated from the 80 percent training set. We queried the large network for nodes and edges corresponding to an individual patient's disease history. Because all the edges that failed to show a significant difference between the fast and slow progressors were pruned, the sub-network might be disconnected. The top five conditions that lead to each complication by percentage of fast progressors and Z-score are given in Table 7.
The results for these predictions of fast progressors for onset of these various diabetic complications are shown in Table 8. These values are averaged over five runs of different test/train splits and they are comparable to the AUCs of other real-world predictive models (Weng et al. 2017).

Discussion
Diabetic complications are often correlated with one another, which might reflect the generalized damage that the body has taken from a micro and macrovascular perspective (Forbes and Cooper 2013). Others have found evidence of biomarkers that have an impact on diabetic progression and can lead to a greater understanding of a patient's personalized developments with diabetes (Scirica 2017). Other researchers have created models of diabetic risk from searching endocrinology text books and literature from clinical trials to search for indicators that lead to complications (Sangi et al. 2015). We believe that our model is unique in its ability to distinguish between fast and slow progressors.

Similarities in Comorbid complications
Many of the top confidence nodes are shared between different complications. Correlations between the fast and slow progressors are given in Table 9. Some of the most significant nodes, including mental diseases such as psychoses, cerebral degenerations, psychotic conditions, and pain, are symptoms or causes of uncontrolled diabetes. Table 7 Here we have some of the health conditions that are most likely to lead to complications based upon percentages of patients with that condition that are fast progressors, and Z-scores which correspond to the Z-test result on these particular nodes between fast and slow progressors  This could be because many diabetic patients are suffering from many of the same comorbidities which have a negative influence on disease control and care (Magnan et al. 2015). Others have found patterns of these co-morbidities, and split diabetics into several classes which represent their progression through diabetes: severe cardiac, cardiac, noncardiac vascular, risk factors, and no concordant co-morbidities (Magnan et al. 2018).
Being diagnosed with a mental disorder soon after a diagnosis with diabetes would have a limiting effect on the patient's ability to maintain glycemic control. Chronic pain also limits the control of patients' diabetes, potentially resulting in development of complications (Krein et al. 2005). Diseases such as arthritis can impair patient function and drive barriers to lifestyle changes and regimen adherence (Piette and Kerr 2006). Other disabling conditions, such as heart failure or dementia, make self-care impossible (Piette and Kerr 2006). Lack of sleep worsens glucose tolerance (DJ et al. 2005), which could lead to fast complication development. Also, diabetic patients are at higher risk for sleep disorders such as nocturia, neuropathic pain, and restless leg syndrome (DJ et al. 2005). Patients with further developed complications could be more likely to have these problems, which lead to sleep disorders. For many patients, diabetic complications do not occur unexpectedly. It is a pattern of poor health that leads to many co-occurring complications of diabetes. Low eGFR is shown to be one of the top confidence nodes for fast progressors in kidney disease, in both the highest distinguishers and absolute percentages. Low eGFR is one of the most important markers of kidney disease (Levey and Coresh 2012). Renal function is a prognosticator of heart failure since it is a good marker for impaired hemodynamic status and general vascular disease (Hillege et al. 2006).

Potential implications for personalized medicine
In our future work, we would like to examine the false positives and identify what causes them to not develop complications immediately, even though their diagnosis history and lab results identify them as fast progressors. This will inform health management strategies -lifestyle, behavioral or environmental factors -in addition to the medication

Conclusion
Given a patient's disease history and lab results, we can predict their likelihood of developing complications from diabetes. We also show what disease diagnoses or lab results (from our heterogeneous network or graph) are most likely to lead to specific diabetic complications. We reaffirm that diabetes is a complicated disease. It continues to be important for diabetic patients to manage their disease and be aware of the complications. The diagnoses graphs can help illuminate health problems faced by many patients and what might be the best course of disease management. Not managing complications, especially for fast progressors, can cause rapid development of uncontrolled diabetes, from which it is hard to recover. Moreover, disease diagnoses graphs can also be a useful tool for physicians to understand the effects of co-morbid conditions, and personalize a wellness and disease management plan. This can lead to an improvement in both individual and population health outcomes.