Skip to main content

Citywide quality of health information system through text mining of electronic health records


A system of hospitals in large cities can be considered a large and diverse but interconnected system. Widely applied in hospitals, electronic health records (EHR) are crucially different from each other because of the use of different health information systems, internal hospital rules, and individual behavior of physicians. The unstructured (textual) data of EHR is rarely used to assess the citywide quality of healthcare. Within the study, we analyze EHR data, particularly textual unstructured data, as a reflection of the complex multi-agent system of healthcare in the city of Saint Petersburg, Russia. Through analyzing the data collected by the Medical Information and Analytical Center, a method was proposed and evaluated for identifying a common structure, understanding the diversity, and assessing information quality in EHR data through the application of natural language processing techniques.


A system of hospitals in a big city can be considered a large and diverse but interconnected system. The interconnection of hospitals and physicians comes from operating in a common information space, same legislative environment, and providing health services to the population of the same city. Currently, electronic health records (EHR) (Nguyen et al. 2014) are widely adopted in healthcare organizations and provide improving consistency and interoperability of health-related information. Citywide EHR integration enables the implementation of large-scale analytical and clinical projects (see NYC Macroscope Newton-Dame et al. 2016 as an example). At the same time, in large cities there exist such factors as multiple levels of healthcare regulation (from government to hospital-level authorities) and diversity in clinicial staff experience, individual approaches, and patterns in clinical decision making, etc. This diversity is much more pronounced if an EHR is implemented in different health information systems (HIS) deployed in hospitals. In many cases, a significant portion of the information in an EHR is stored in the unstructured (textual) form. For example, anamnesis, diagnosis, epicrisis, surgery protocols, and conclusions may be stored in such form. At the same time, this part of the information is often an important source for systematic analysis of health service quality. In such a situation, understanding and assessing complex health information operated in a citywide healthcare system can be quite challenging.

Within the presented research, we focused on the analysis and structuring of EHRs from hospitals in Saint Petersburg, Russia. The data was provided by the Medical Information and Analytical Center (MIAC)Footnote 1 responsible for monitoring and assessing the quality of healthcare service in the city. We analyze the EHR data, particularly textual unstructured data, as a reflection of the complex multi-agent system of healthcare in the city of Saint Petersburg. Such data may contain a lot of additional information that can not be displayed in a structured form (numbers, codes, enumerated concepts). Thus, this data falls out of the review by both analysts and doctors. However, unstructured data can be used to analyze the systemic quality in the city healthcare: to identify sources of uncertainty, build information-behavioral profiles of doctors, and assess influencing factors, the MIAC acts like a distributed heterogeneous information system. It contains people (doctors), individual HISs, and individual healthcare facilities. All this is connected implicitly though the general population of patients and the legal field, in which doctors work. Each “information agent” acts relatively independently, filling the system with information in the process of serving the patient flow. This way, we can watch this variety of connections through EHRs, the quality of which we strive to evaluate.

Within the presented study, we propose and elaborate an approach to unify EHR data to improve their structure using natural language processing (NLP) techniques. The approach is considered as a way to work automatically with diverse data (i.e., coming from different hospitals and HISes) to obtain and assess the implicit structure presented within the data. The approach may be used to structure the diverse data, improve the analysis and assessment procedures of large heterogeneous healthcare systems existing in big cities through the EHR data collection. Such improvement may increase the quality of both analytical procedures (e.g., implemented by the MIAC in Saint Petersburg) and hospital-level EHR interoperability characteristics.

The structure of the paper is as follows. The next section provides an overview of the related works in EHR structuring and quality analysis. A case study of citywide healthcare quality analysis in Saint Petersburg and the dataset used are described in “Case study” section. Next, the proposed method and implementation details are provided in “Implementation details” section. The obtained results are presented and analyzed in “Results” section. Finally, “Discussion” and “Conclusion and future work” sections provide a discussion and concluding remarks of the study respectively.

Related works

Structuring and assessing a medical text to determine its completeness and search for an ideal structure are still highly relevant to most hospitals and healthcare institutions.

In their research, Weiskopf and Weng (2013) provide a review of methods to assess the quality of EHRs. They define five criteria of quality (completeness, correctness, concordance, plausibility, and currency) and seven methods that help to check EHRs according to one or more criteria: comparison with gold standards, data element agreement, data source agreement, distribution comparison, validity checks, log review, and element presence. Similar criteria (accuracy, correctness, validity, completeness, timeliness, usefulness, etc.) are identified by (St-Maurice and Burns 2017). One of the most popular methods is gold standard compliance. Based on this paper, other researchers expand the list of criteria. For example, Batini collects criteria from many papers and there are new ones among them: usefulness, cost-effectiveness, and confidentiality (Batini and Scannapieco 2016). However, it is complicated to interpret some of the criteria for assessing textual data. For our research, we aim to construct a gold standard based on a large amount of data and the doctors’ experience that is invested in them and then to compare new records with this gold standard using the methods of data element agreement and element presence.

Another approach is to check for the presence of certain records in EHRs to assess their completeness and relevance. For example, van der Bij et al. (2017) estimate such parameters as a percentage of episodes that have a “meaningful” ICPC code, percentage of drugs linked to an episode of care, and others. Burke et al. identify 12 structures in EHR to assess its quality (Burke et al. 2014). In our case, we plan to check for the presence of certain information within one record. There are other methods for assessing the structure and completeness of records. Logan et al. (2001) try to find an acceptable way of EHR recording and assess the completeness and correctness by comparing the video of a patient’s a doctor's meeting with the EHR data and their structure. Often, studies aiming to find the most complete and accurate format are based on surveys of a small number of doctors (Williams 2003). It is possible to develop such a model to extract specific information (Wang et al. 2012; Yehia et al. 2019). However, it is necessary to define such a list of questions and entities for each record type manually. So, it is less applicable for various real-world records that have accumulated in many HISs and hospitals.

Often, data semantics are presented in the context of data interoperability to transfer data between different MISs and clinical applications. Nguen et al. conducted a systematic review of EHR implementation with an assessment of information systems with DeLone and McLean’s framework (Nguyen et al. 2014). Sun et al. present the architecture of their semantic processing approach where data is transmitted through the semantic layer with clinical ontologies inside (Sun et al. 2015). Most solutions for data interoperability are based on ontologies (Sun et al. 2015; Roberts and Demner-Fushman 2016; Freedman et al. 2020; Kersloot et al. 2020). Moreover, Kersloot et al. (2020) show the statistics for mapping clinical text fragments to ontology concepts that are described in reviewed papers. They conclude that 88% of the studies do not present any validation. Moreover, nowadays it is common to produce semantic interoperability between different MISs and clinical databases using developed libraries for database matching (Bruland et al. 2017). Also, semantic assessing and overview are often presented for academic texts (scientific papers, articles and books) that have specific preprocessing and methods for formal language that are much different from clinical records (Datta et al. 2019).

Another approach is not to try to structure the records, but to extract specific knowledge on demand. For example, Lamy et al. (2019) show an example of a pipeline for finding and extracting the necessary information for the Portuguese language: Portuguese EHRs are translated into English, and then ready-made and already well-proven tools for English are used. Recent studies often include machine learning approaches to retrieve information from EHRs and assess the completeness of the records. This way, works by Tang et al. (2013), Funkner and Kovalchuk (2020) investigate NLP within the task of reconstruction of temporal structures and events from EHRs.

Most of the authors conclude the importance of structuring and analysis of EHRs to improve the quality, interoperability, and integrability of both information and health service. One of the most important problem is improvement of structure and interoperability (as a consequence) of EHR data. Within our study, we focus on understanding the structural diversity and possible interpretation of EHRs in the healthcare system of large cities through the analysis of unstructured (free-form text) parts of EHRs collected from hospitals. The presented approach is aimed towards the automatic (unsupervised or semi-supervised) structuring procedures that can work for EHR weakly-structured data without predefined domain-specific unified structures, dictionaries, and semantics. Also, an important advantage of such an approach is possible translation to low-resource languages where domain-specific NLP tools arent’ presented well.

Case study

For this study, we consider a set of 79,234 depersonalized records of patients with arterial hypertension (AH) and acute coronary syndrome (ACS) who applied to medical centers in St. Petersburg, Russia, in 2020. The data was collected by the MIAC for the analysis of EHR and health service quality. The records were provided by 107 institutions using 13 different HISs (Fig. 1). The selection of HISs was developed, provided, and supported by different vendors. Each HIS has its architecture and user interface. Thus, the common practice of EHR input varies significantly.

Fig. 1
figure 1

Distribution of EHR providers (healthcare organizations) in Saint Petersburg with different HISs

The structured EHR data collected by the MIAC is widely used for monitoring and analytical purposes as well as for centralized development and regulation of informatization of the city healthcare system. However, even though the textual data in EHRs contains important information on the provision of health services, use of such data is significantly limited due to the lack of structuring and diversity in format. The practical goals of this study are analysis, structuring, and quality assessment of the EHRs. The results may be used by the MIAC to improve their analytical facilities as well as by hospitals to support better information processing, interoperability, and clinical decision support in their HISs.

Commonly, unstructured EHR data in Russian practice is a natural language text that contains many specific medical terms, abbreviations, words in Latin and, less often, English (names of equipment, drugs). Unfortunately, raw text often contains typos and other distortions caused by the data transfer between information systems (connected words, lack of separators between sentences, HTML and XML tags). Such texts can have some structural features: they contain subheadings or field names separated by colons inside them (see examples of possible textual EHR data structure in Fig. 2) Most often, an unstructured text presents a patient's life and illness history, discharges, records of consultations, and less often protocols of operations and other medical procedures. For example, records can have a title ‘Anamnesis’ and its text (Fig. 2a), or title ‘Protocols of operation’ and subtitles ‘Type of the operation’, ‘Duration of the operation’, etc. (Fig. 2b), or does not have any title inside, but has incorporated subtitles as ‘Diagnosis’, ‘Vital signs’, etc. (Fig. 2c). Also, records can be totally unstructured without any indicated titles and subtitles (Fig. 2d). However, each of these records has its format and features of the language structure that depend on the medical center and the HIS. The above problems have a critical impact on the speed and ability to automatically process such texts. It is also worth noting that the data is presented in Russian, which narrows down the range of available tools for language processing.

Fig. 2
figure 2

Different forms of unstructured textual EHR data

Implementation details

General processing phase

Currently, there is a lack of ready-to-go technologies available for domain-specific medical text analysis in a language other than English (Névéol et al. 2018). For the last year, our research team has been developing a set of tools for automatic processing of medical texts in Russian. These tools are implemented as extensible Python modules aimed at processing various types of medical texts. Currently, there are five tools at different stages of development (Fig. 3): spelling correction, negation detection for diseases, extracting the experiencer of the disease, topic segmentation, and extraction of temporal structures and events (Balabaeva and Kovalchuk 2020; Balabaeva et al. 2020; Funkner and Kovalchuk 2020; Shaikina and Funkner 2020; Funkner et al. 2020). Each tool solves a specific problem or helps with text preprocessing, but none of them determine the general structure of the text.

Fig. 3
figure 3

NLP modules for processing medical texts in Russian

The current study uses and extends the implemented software for NLP, structuring textual data, and identifying basic elements of EHRs.


This section describes methods for processing records, structuring them, and assessing their quality. Figure 4 shows the three main stages of record processing: identifying the type of record, substructure recognition, and assessing the quality. At each stage, models are trained (blue elements in Fig. 4), which can be used for new records. Besides, Fig. 4 shows the topics and record formats (yellow elements), which are defined in the training dataset and can be easily interpreted by a specialist.

Fig. 4
figure 4

Diagram of methods to detect and estimate record structure

EHR type detection

Within our study, we consider processing a dataset containing EHRs collected from different medical centers with different HISs. The first step in structuring such heterogeneous records is to identify the type of record (consultation with a doctor, test results, surgery protocol, etc.). Since records are collected in a large number of healthcare organization with their own rules and practices, the same type of record may have different names.

We propose to use the classic approach for grouping text titles: preprocessing (removing extra characters, lemmatization), removing stop words (mainly prepositions, pronouns, conjunctions), TF-IDF (term frequency, inverse document frequency) transformation to reduce the weight of background words (for example, the word doctor or hospital in this context is a background, but in general are not stop words), clustering of TF-IDF vectors. With this study, we use only record titles to identify a record type. However, it is possible to add extra features of records, but model training complexity will increase.For clustering, there are many appropriate methods: hierarchical clustering can show the nesting of clusters one into another when using different thresholds, the k-means method is easily interpreted in terms of vectors and the “central” record can be found (closest to the cluster center), with which other records of the cluster are compared. We propose to use the OPTICS method, which allows finding clusters in the feature space based on density (Ankerst et al. 1999). After some experimentation, we notice that names of some types of records do not differ much between medical centers (for example, discharge reports). At the same time, there are types of records, for example, specialist consultations, which may have different lengths (institutions add the name of the medical department, type of specialist, doctor’s name, etc.) and content names (synonyms: examination instead of consultation, etc.). Thus, groups of names of different density and size are formed in the feature space, for the identification of which the OPTICS method is the most appropriate.

To determine the type of record with new data in the future, it is proposed to train a classifier for which the cluster number will be used as a class. One of the simplest and most suitable methods is the k-nearest classification since it is based on distribution of vectors in the feature space.

At this stage, it is possible to use any other methods of vector transformation, clustering, and classification that are most suitable for the peculiarities of the record language and the specificity of the recordset.

Subsection recognition

Finding substructures in records is highly dependent on the dataset. With our dataset, it was noticed that records of the most represented HIS contain subsections, whose names (subheadings) can be extracted using regular expressions. Thus, each record is divided into subsections with a subheading. Often, the beginning of a recording does not have a subheading, so it is given the service name #record_start. Figure 5 shows an example of splitting a record using regular expressions.

Fig. 5
figure 5

Example of identifying subsections in a record

If records do not contain subheadings or other substructures, one can skip this step. If the texts of the record are quite long and consist of several paragraphs, then topic modeling can be carried out on these paragraphs. This will allow to divide the records into substructures and reveal their format.

Subsections are grouped by subheadings. For each type of subsection, separate topic modeling is carried out using the method of additive regularization. Additive regularization of topic models adds regularizers to the matrix decomposition, which help to highlight background topics, sparse the topic matrix for a clearer separation between topics, and automatically determine the number of topics (Vorontsov et al. 2015). In addition, this method shows the best modeling results on texts of different lengths, which is typical for EHRs from different medical centers. We carry out 50 iterations without any regularizers and extra 30 iterations with sparse regularizer (smooth sparse phi regularizer with tau = 1e6). However, tau value for regularizers depends on the number of input texts and their lengths, so for much larger or less corpus, the value should be tuned manually. Topic modelling on each sectionallows us to identify key terms (excluding background words) for each type of subsections, compare them with each other and validate how much the declared subheading corresponds to the content of the subsections.

Based on the identified topics and their key terms, the topic segmentation model that we developed earlier is trained (Shaikina and Funkner 2020). This model calculates the frequency of topic terms in each sentence and adds the coefficient of the most frequent terms of the previous and next sentences. The trained topic segmentation model can be used to label other records that do not have substructures inside.

Accessing quality and record format detection

Subsection recognition” section describes how to extract subsections from records and thereby identufy record structures from data. The next step is to identify the typical record structure for the extracted subsections. In this work, we calculate the frequency of subsections by type of records and, according to the selected threshold, determine the most appropriate subsections for each record type. The choice of the threshold can be manual or automatic with searching for a critical value: the threshold is calculated by determining the proportion of records that are considered “ideal”.

“Ideal” post formats are stored as a dictionary, where the key is the record type and the value is a list of subheadings (Python programming language):

figure a

After determining the “ideal” format, we can calculate how many of the required subsections each record contains (after extracting subsections with regular expressions or topic segmentation, see “Subsection recognition” section). In addition, based on the “ideal” format, we can make recommendations about which sections to add, and which ones are better to transfer to other types of records.


Preliminary data analysis

Each provided record includes metainformation (patient ID, specialist ID, institution ID and name, HIS ID, date, ICD-10 diagnosis, record name) and free-form text. Text can be presented in different forms and includes from 0 to 827 sentences (Fig. 6). Moreover, Fig. 6 shows how different HIS records are. For example, the most represented HISs (#1 and #5) have the same dispersion for text length and number of words and sentences. HIS #11 includes shorter texts, but has many more words and sentences on average. Probably, HIS #11 has more abbreviations and omitted words inside its records. HIS #3 provides long texts but the number of words is about zero. It means that the records are filled with special signs and HIS tags. HISs #10, #12, and #4 for all metrics have a median of about zero, so we suppose that most of the records are empty or filled with meaningless special signs.

Fig. 6
figure 6

Comparison of HIS corpora for text length (a), number of words in texts (b), and number of sentences in texts (c). HISs are sorted by the number of records

We also compared unique words and their incidence rate in each HIS. HIS #1 and #5 contain the most unique words: 57,348 and 33,327 words, respectively. Also, they have the largest intersection rate of unique words: 15,231 words (see Fig. 7). HIS #11 contains only 48 records (records are long and have many words and sentences, see Fig. 6), however, it has one of the largest intersection rates with HIS #1 and #5. This indicates the similarity of the content of records in HIS #11 to the most represented HISs.

Fig. 7
figure 7

Upset plot for the most common intersections of unique words of HISs

Table 1 shows the most frequent words for HIS #1, #5, and #9, excluding stop words. The most common words for these HISs are units of measurement: doses of prescribed drugs and results of medical tests (mg), blood pressure measurements (mm, Hg), heart rate (min), frequency of drug intake (pill, day, morning, evening). Also, many words are associated with describing a patient's condition: history, state, breathing, heart, satisfactory, complaint, diagnosis, etc.

Table 1 Most common words in the most presented HISs

We also try to estimate the number of misprints and spelling errors in the texts (see Table 2). Using the module for correcting misprints (Balabaeva et al. 2020), we compare the unique words from the records with the corresponding dictionaries (dictionaries of medical terms, Russian spelling dictionary, English dictionary, dictionary of medicines). Also, based on these dictionaries, the proportion of correct (found in dictionaries) words is estimated. Most misprints in unique words are contained in HISs #1 and #5, which is expected: the more words, the more mistakes. However, HIS #3 has a low share of correct words, although it contains only 170 entries. When processing the words of this system, it is necessary to correct misprints.

Table 2 Metacharacteristics: misprints, terms, abbreviations according to dictionaries

Structuring and analysis of EHR data

We have applied the methods described in “Implementation details” section to the MIAC dataset. As shown in Fig. 8, HISs #1 and #5 have the most records (88% of records in total) in the analyzed dataset. Therefore, we used the data of these HISs for further training and validation of all the models.

Fig. 8
figure 8

Distributions of record types for patients with arterial hypertension (AH) and acute coronary syndrome (ACS) with manual labeling

Figure 8 shows the manually labeled types of records for all HISs. As can be seen, HIS #1 contains only three types of records: examination (83% of all records), epicrisis and doctor's consultation. HIS #5 contains 10 types of records. As HIS #2 has specific names for its records, all of them are labeled as “other”.

Furthermore, automatic labeling of records by types is carried out: the OPTICS method for clustering is applied to the preprocessed and vectorized headers of records (see “EHR type detection” section). After reviewing all the clusters, the names for each group were determined. Compared to manual labeling (Fig. 8), clusters have more specific names (Fig. 9). Also, we compare how manual and OPTICS clusters are related by records (Fig. 10). The main share of records from the manual category “examination” transfers into the category of “examination sheet”, and so on with “epicrisis” and “statement epicrisis”. However, the manual categories “consultation”, “appointment” and “other” are divided into consultation and examination groups by different specialists. Besides, Table 3 shows clustering metrics for both labeling systems: silhouette coefficient (the closer to 1, the better), Calinski-Harabasz index (the higher the better), and Davies-Bouldin index (the lower, the better). OPTICS clustering shows better results according to all calculated metrics. Figure 11 shows how OPTICS records are distributed across HISs. HIS #2 now has several types of records, not just “others” as in Fig. 8. HIS #1 still consists of three types of records.

Fig. 9
figure 9

t-SNE dimensionality reduction for record header space (the header points are colored according to OPTICS cluster)

Fig. 10
figure 10

Distribution of records by manual clusters and OPTICS clusters

Table 3 Comparison of labeling using clustering metrics
Fig. 11
figure 11

Distributions of record types for patients with arterial hypertension (AH) and acute coronary syndrome (ACS) with OPTICS labeling

At the next stage, we extract the subheadings and subsections (see “Subsection recognition” section) for HIS #1 and HIS #5: 225 and 150 subsections, respectively. Subheadings are grouped manually to simplify visualization and primary analysis. The distribution of subsections is shown in Fig. 12. For some record types (questionnaire, form, and impression), no subheadings were found using regular expressions. In general, subsections are well related to what should be in the record type: “report” contains the largest proportion of subsections related to health indicators (blood pressure, heart rate, etc.); “epicrisis” contains the largest proportion of “symptoms”, as it describes what happened to the patient during the hospitalization.

Fig. 12
figure 12

Distribution of subsections among record types for two selected HISs

In addition, topic modeling and segmentation were carried out on the texts of the subsections (see “Subsection recognition” section). The training, validation, and test set of each HIS is 50%, 20%, and 30% of the entire set, respectively. Thus, for HIS #1, automatic segmentation to predict the type of subsection shows the result of 0.51 F1-sore (38 classes), and for HIS #5 it is 0.11 F1-score (19 classes). The low metric is due to the large variability of subheadings (it is necessary to carry out careful grouping and processing to find the same names and combine subsections more correctly). Similar clustering and classification methods can be applied as in “EHR type detection” section.

Accessing quality and record format detection” section describes methods for assessing the quality of records. Figure 13 shows the percentage distribution of subsections by record type. As can be seen, the types of records are characterized by different types of subsections and their number. To assess the quality, a threshold of 10% was chosen: if at least 10% of the texts of the considered subsection are contained in this type of record, then this subsection is typical for this type. Based on this threshold, the records of HISs #1 and #5 are assessed for each medical center separately. Table 4 shows that less than 1% of the records are found to be reasonably accurate (containing more than 80% of the required subsections). In total, there are more than 30 thousand records for each of these two HISs.

Fig. 13
figure 13

Percentage distibutions of subsections among records type for HIS #1 and HIS #5 (different colors for different subsections)

Table 4 Automatically retrieved accurate records for HIS #1 and HIS #5 according to subsection representations

Interpretation and evaluation

This section provides an expert evaluation of the quality of records based on the extracted data about the record type and subsections within the record. The evaluation was performed together with specialists of MIAC who are involved in monitoring and assessing the information quality of hospitals in Saint Petersburg. The goal of the evaluation was twofold. First, the analysis of the EHR structure and completeness was performed for selected hospitals. Second, the comparison to existing assessing procedures applied in MIAC was performed to consider possible extension and updating of them. To reach the goal, the current study was focused on indices based on EHR structure which relatively reflect the completeness of EHR. At the same time it enables close comparison to existing official measures of EHR implemented in MIAC.

Based on the processed information, an index of the EHR structure in a healthcare facility was constructed with an assessment of the presence of the necessary subheadings for patients with arterial hypertension and acute coronary syndrome (AH and ACS) for 23 most represented institutions using HIS#1 and HIS#5.

Lets a record from dataset be \(r\) and each recod includes subtitles \(s\): \(r_{i} = \left\{ {s_{1} ,s_{2} , \ldots ,s_{i} } \right\},\) where i the number of records in dataset. To calculate the EHR structure index, the various types of EHR were grouped into three groups: (1) epicrisis (\(G_{1} = \left\{ {r_{1} ,r_{2} , \ldots ,r_{{G_{1} }} } \right\}\)); (2) examination, appointment, consultation (\(G_{2} = \{ r_{1} ,r_{2} , \ldots ,r_{{G_{2} }} \}\)); (3) other records, including reports (\(G_{3} = \{ r_{1} ,r_{2} , \ldots ,r_{{G_{3} }} \}\)). For each group, the possible set of subtitles \(T_{{G_{i} }}\) are calculated with the formula:

$$T_{{G_{i} }} = \left\{ {s_{1} ,s_{2} , \ldots ,s_{{k_{{G_{i} }} }} {|}s_{m} \in \bigcup\limits_{j = 1}^{{n_{i} }} {r_{j} } ,m = \overline{{1,k_{{G_{i} }} }} } \right\},\quad i = \left\{ {1, 2, 3} \right\}.$$

The total frequency of occurrence of subheadings \(s\) in these groups \(G_{i}\) relative to the total number of records \(n_{i} = \left| {G_{i} } \right|\) of each type is calculated. The average sum of occurrence rates for each record type, normalized by the total number of subheading types, can take a value from 0 if there is no information for all subheadings in all record types to 1 in the opposite case when all subheadings are filled:

$$\begin{aligned} SI\left( {G_{i} } \right) & = \frac{{\mathop \sum \nolimits_{{s_{j} \in T_{{G_{i} }} }} \mathop \sum \nolimits_{{r_{l} \in G_{i} }} \Delta \left( {s_{j} ,r_{l} } \right) }}{{n_{i} k_{{G_{i} }} }},\quad n_{i} = \left| {G_{i} } \right|,\quad k_{{G_{i} }} = \left| {T_{{G_{i} }} } \right|, \\ \Delta \left( {s,r} \right) & = \left\{ {\begin{array}{*{20}l} {1,} \hfill & {if\quad s \in r} \hfill \\ {0,} \hfill & {if\quad s \notin r} \hfill \\ \end{array} } \right.. \\ \end{aligned}$$

This value is proposed to be taken as an index of the EHR structure.

The results of calculating the EHR structure index are presented in Table 5.

Table 5 Comparison of EHR structure indices by HIS and records with AH and ACS diagnoses

To conclude, from Table 5, the EHRs are better managed in HIS#1. This may indicate both a more developed functionality and a deeper level of implementation in this type of HIS. HIS#1 is implemented only in outpatient clinics, whereas HIS#5 is implemented in both outpatient and inpatient clinics. To compare different types of HISs under the same conditions and to exclude the impact of different EHR requirements in the case of outpatients and inpatients, an index for each HIS was calculated, see Table 6.

Table 6 Comparison of EHR structure indices by outpatient and inpatient clinics

Based on the index of the EHR structure for each medical organization, a rating of healthcare facilities was formed, which allows to position healthcare facilities from more structured management of EHRs to a less structured and less detailed one. This rating allows to objectively compare different clinics and apply administrative or incentive measures to equalize the quality of EHR management. In addition, analysis of the dynamics of changes in the EHR structuring index in Saint Petersburg as a whole can allow us to draw conclusions about the effectiveness of the use of organizational and financial incentives and to forecast the achievement of target levels of EHR structuring.

In Saint Petersburg, the measurement of EHR completeness and quality index introduced by the MIAC for assessing the city’s healthcare facilities (hereinafter the MIAC index) is already in use. The MIAC indices are calculated according to officially approved methodology for assessing the EHR completeness in different hospitals. The index is based on presense of explicit records and documents provided by the hospital within the integrated HIS. The calculated indices for different hospitals in Saint Petersburg are published periodically on the official MIAC site both for city level and for different city hospitals (MIAC 2021). The correlation between the MIAC index and the EHR structure index in healthcare facilities was calculated. It turned out to be low (equal to 0.231), which practically means a weak relationship between these indices. This can be explained by the fact that the MIAC index characterizes the completeness of the transmission of records, while the EHR structure index characterizes the completeness of EHRs themselves (see Table 7).

Table 7 Overall rating of healthcare facilities by the EHR structure index in comparison with the MIAC index


Textual data in an EHR contains an important portion of the information regarding the healthcare service provided to the patient. Structuring of such information plays an important role in multiple tasks including improvement of information consistency and interoperability, clinical decision support, and healthcare facility assessment. Within our case study, the relatively low correlation between the informativeness of structured and unstructured parts of EHRs through the presented indices was observed. After the detailed analysis of existing structures and variation in EHRs discovered during this study the discussion was initiated with the MIAC experts. One of the reached conclusions was that the existing indices used for assessing completeness and quality of EHR applied in practice need to be extended with deeper analysis of EHR with the developed procedures. Thus, structuring and analysis play an important role in the improvement of EHRs both when they are collected in the MIAC and inside the HISs of the hospitals. Also, the possible update of existing indices can improve the assessing the quality of information management in hospitals and organization ranking over the city by making the indices more detailed and well-grounded.

Another important result that can be seen through the diversity and structuring analysis are the behavioral patterns of physicians who input EHR data. The patterns can be seen through the structure and meta-characteristics of the text. They reflect the principles and practices in clinical decision making, as well as the experience of a physician. Moreover, additional closeness of EHRs within a single hospital can be further explained through the “common information space” within a hospital where general rules and practices are implemented in a unified way. Further analysis and interpretation of the diversity can be considered a source for identification of hospitals’ and physicians’ profiles within a complex citywide healthcare system.

The proposed method may be considered a general way for analysis and structuring of EHR data in diverse datasets. The approach enables a deeper understanding of the sources of diversity and differentiates particular structures in EHR data in an automatic or semi-automatic way. Considering EHR data as a reflection of the real-world healthcare system and the processes in it, a possible application is assessing and improving healthcare service through identification and sharing of the best practices both in terms of clinical decision making and information structuring. We believe that the proposed approach may be used to structure EHR data for better understanding and analysis of distributed healthcare systems.

Conclusion and future work

Within the proposed work, we introduce an approach for structuring and analysis of EHR data in a distributed complex healthcare system. The proposed method can be applied in diverse applications including assessment, improvement of information in EHR systems, and extending the healthcare service with additional clinical decision support and analytical services. Within this research, we consider a case study of the city healthcare system in Saint Petersburg, Russia, to introduce additional structuring, analysis, and assessment of healthcare facilities. The obtained results were used by the MIAC for further improvement of the citywide healthcare monitoring and assessment system. The listed problems of processing unstructured records and the absence of a unified HIS in Saint Petersburg are the basis for a new large-scale project for the analysis, unification, and standardization of the accumulated data in the MIAC to analyze the citywide quality of healthcare. We believe that the proposed approach may be applied in different cases where diverse EHR data is processed and analyzed (e.g., data collected on the level of large cities or even a country).

Further development of the approach includes several directions. First, the approach can be extended with a deeper interpretation of the diversity in EHRs (including personal experience, local policies, common information in hospitals, etc.). Second, the multiscale information sharing between physicians, hospitals, HISs can be estimated and analysed. Third, physician profiling and personal practices can be identified, structured, and assessed to correct (as bad practices) or share (as good practices). Finally, information exchange in a global diverse environment can be optimized to improve both clinical practices and information interoperability.

Availability of data and materials

The data that supports the findings of this study is available from the MIAC ( but restrictions apply to the availability of this data, which was used under license for the current study, and so is not publicly available.


  1. (in Russian).



Acute coronary syndrome


Arterial hypertension


Blood pressure


Electronic health records


Hypertension disease


Hydrargyrum (Mercury)


Health information system


HyperText markup language


International classification of diseases






Medical information and analytical center






Natural language processing


Ordering points to identify the clustering structure


T-distributed stochastic neighbor embedding


Term frequency–inverse document frequency


Extensible markup language


Download references


Not applicable.


This work was supported by the Ministry of Science and Higher Education of Russian Federation, goszadanie no. 2019-1339.

Author information

Authors and Affiliations



All authors have contributed equally.

Corresponding author

Correspondence to Anastasia A. Funkner.

Ethics declarations

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Funkner, A.A., Egorov, M.P., Fokin, S.A. et al. Citywide quality of health information system through text mining of electronic health records. Appl Netw Sci 6, 53 (2021).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: