Background

Medical data is one of the big data, and its intelligent management and utilization have necessarily required. Figure 1 shows the trends of medical systems/services and related issues according to the changes of medical information. In the past, it was difficult to efficiently manage clinical information since most of them were generated and stored as paper charts or analogue films. In order to digitalize these medical data, many efforts have been taken for developing HIS (hospital information systems), OCS (order communication system), PACS (picture archiving and communication systems), and EMR (electronic medical record) systems for electronic charts and digital videos/images. After then, many studies [17] have aimed for efficient storing, sharing, and transferring electronic clinical materials, and some works [8, 9] have proposed methods to analyze and process individual clinical materials like medical images or videos.

Fig. 1
figure 1

Trends of medical information utilization

Recently, rather than electronically storing clinical materials or processing each data, the needs for discovering meanings from medical big data and utilizing them have significantly increased. In particular, as some works have tried to use medical/health data and personal life-log for personalization services, the standardization of those data have been conducted. The international institutions realized the value of EPR (electronic patient record), EHR (electronic health record), PHR (personal health record), health data, UHR (universal health record). For describing those data, they have presented standards like CCR (continuity of care record) [10], CDA (clinical document architecture) [11], and CCD (continuity of care document) [12]. They provide the specifications for the exchange, sharing, and integration of electronic medical and health information. These standards support the management of general information and provide XML (extensible markup language)-based schemes. However, they cannot consider the disease-specificity and cannot describe associative relations of data. That is, with the standards, it is hard to describe details and semantic relations between heterogeneous data related to a specific disease. Furthermore, some works have actively tried for processing, mining, and overall managing big data in medical and health areas. And some studies have aimed for diagnoses of patients’ condition or prediction prognoses.

However, medical data and services have special and particular characteristics compared to other general big data areas. The purposes of general big data mining are to discover hidden meanings, to support decision-making, to recommend actions, and to predict the future by processing and analyzing data which is treated as useless. In contrast, in medical fields, many experts are concerned about the risk and distrust of the analysis or decision by systems which are not human experts. Therefore, rather than the diagnosis or decision about treatments, an efficient provision and management of medical data is more feasible and helpful for medical experts. In other words, the technologies of automating or replacing roles of physicians are less practical, and a method to manage clinical data is necessary so that experts can promptly search and refer related data. It can be helpful to provide clinical data through various views based on their semantics, importance, and relations. Therefore, in this paper, we propose a semantic convergence method by analyzing characteristics and associations of heterogeneous medical data which is currently distributed in different systems.

Medical data generated in the medical fields is very vast in volume. As well as its types, formats, and attributes are heterogeneous depending on the medical departments and diseases. Medical data includes common parts such as a patient’s basic profile and vital information, but clinical materials (documents, examination results, etc.) are different in modality for diseases. One type of materials can contain data of various types, formats, properties, and significance. Besides, kinds and formats of one data depend on institutions and medical experts (inspectors, physicians, and creators of materials). In our paper, we select a target disease to consider the disease-specific characteristics. Acute myocardial infarction (AMI) is an urgent disease that physicians should make a rapid decision about a patient’s condition and determine procedures from their information in the golden time. This target disease is proper and optimal to show the necessity and effects of our methods to converge heterogeneous medical data.

Therefore, in this paper, we propose a method for a semantic convergence and modeling of medical information. In particular, in order to consider the disease-specificity, we focus on AMI, since it is a critical disease that requires the efficient provision of essential parts among vast information for the quick decision in the golden time. The proposed method can extract semantic data which are distributed in various medical materials and unify them into one record.

Related works

In this section, we explain summarizations and limitations of previous works shor complete stop of the blood flow to the heartown in Table 1.

Table 1 Related works for describing and analysing medical information

Standards for medical data

As personal health information including medical data has become increasingly important, related institutions like ASTM (American Society for Testing and Materials) and HL7 (Health level 7) have developed various standards for health records. The most representative standard CCR [10] is a XML-based standard to electronically describe patients’ health information. It consists of three core components, the CCR Header, the CCR Body, and the CCR Footer. The CCR Header defines basic information (unique identifier, creation date, etc.) for documents or records, and the CCR Footer describes additional information. The core part, the CCR Body, contains patient-specific significant information such as medical problem (date, condition, etc.), family history (blood type, genetic relatives, etc.), social factor (life pattern, environmental factors, etc.), allergies, medication, and so on.

CDA [11] is a document markup standard for the structure and semantics of clinical documents. It provides a kind of a template for clinical documents and comprises two parts. The CDA Header contains basic elements for a CDA document like its type and provider. The CDA body specifies all the sections of the health record such as diseases, medical procedure, plan of care, and so on.

CCD [12] is a specification resulting from a collaborative effort between ASTM and HL7. It maps the CCR elements and CDA elements. Figure 2 shows a summarization of CCD standard which is similar to CCR or CDA elements.

Fig. 2
figure 2

The specification of CCD

These standards for medical and health information have strengths that they can cover a wide range of general health information and represent them as XML-based forms. However, they cannot describe detailed data for a specific disease and lack flexibility in their specification. Moreover, they cannot consider semantic relations between data elements. In other words, with the standards, it is difficult to represent specific data related to a certain disease like AMI and semantic associations between heterogeneous data.

Analysis of medical data

In the past, many studies have been constantly conducted to generate medical data models and semantically interpret or analyze them. Especially, some of them have presented data models for describing semantic data embedded in videos/images which are one of the most important medical materials [1315]. They proposed ontology schemas by extracting characteristics of medical videos or images and tried to develop systems for annotations or automatic extraction of semantics. However, since they just focused on one type of clinical materials, they did not consider heterogeneous data in EMR or PACS. In recent years, some works have started to construct domain knowledge bases from semantic relations of data stored in medical information systems like EMR [16]. However, those works focused on automatically finding relations of symptoms and disorders. Therefore, studies on efficient providing methods of medical data by converging of heterogeneous data are still insufficient.

In studies of cardiac diseases which are the target domain of this paper, some works have aimed at automatically processing or analyzing videos of CT (computed tomography) or angiography [17, 18]. Also, some systems have been developed in order to inform patients’ conditions to medical experts, to recommend required treatments, or to predict prognoses [19, 20]. However, these works considered a single kind of medical data and applied statistical analyses or rule-based mining to fragmentary data of patients’ current conditions. Those decision or diagnosis systems have little practicality in real medical institutions.

Data analysis for AMI

Acute myocardial infarction (AMI) is a disease known as a heart attack which occurs when the coronary arteries become blocked or narrowed due to a buildup of plaque. This can cause a significant decrease or complete stop of the blood flow to the heart. In the case of AMI, a correct diagnosis and treatment should be done within 90 min at most and its total time has direct effects on a prognosis and a rate of death. Therefore, systems should be able to provide and manage essential data required in emergency so that medical experts can make a quick decision. However, in the current HISs, the scope and kinds of data that experts can check depend on the institutions’ cooperation or physicians’ capability since medical data for one patient is separately stored and managed by institutions and kinds of data. These management methods can cause fatal results in emergency of AMI.

In this paper, for converging of medical information, we collected important clinical materials related to AMI, extracted semantic data from individual materials, and defined one data record based on their associative relations.

First of all, in order to enhance the reality or practicality of our data model, we collected materials which are actually used at four general tertiary hospitals in Korea. Then, we selected eight types of materials commonly used. Our samples contain four types of images/videos, two types of reports, and two types of examination result reports.

Categorization of medical data

As mentioned in the introduction section, the modeling of data contained in clinical materials is necessary for intelligent utilization of medical information. Before defining a data model, it is necessary to analyze types and characteristics of data which are generated and used at medical institutions. Therefore, in this section, we define the scope, kinds, and properties of clinical data which should be included in our convergence model by classifying them according to the degree of structuration.

In general, data can be categorized into three types: structured, semi-structured, and unstructured data. Table 2 shows a categorization of medical data with these criteria. The classification of medical data is different from general definitions due to the characteristics of medical fields. In general domains, most of text data is classified as unstructured data, but text descriptions in clinical materials can be treated as semi-structured data because they are written with regular patterns or formats. That is, medical experts usually use specific words or styles of sentences when they write annotations or comments. These medical data are stored in EMR systems and PACS. However, since the systems separately manage these data, they do not consider their relations, and experts have to manually search and check individual materials. For example, even though clinical materials like charts, videos generated by coronary angiography, and examination reports for a patient contain similar information, they are stored in different systems and managed by a patient ID without considering semantic relations.

Table 2 Structure-based categorization of medical data

Analysis of AMI-related data

In order to reflect the disease-specificity in a data model, we collected and analyzed eight clinical materials which are essential for the diagnoses and treatments of AMI. As shown in Table 3, they consist of four types of materials, and one material contains data of various formats.

Table 3 AMI-related clinical materials
  • System input materials generated by entering data in fixed fields in electronic systems of hospitals. Most of them contain structured data.

  • Document various reports which include both of structured data and semi-structured texts.

  • Image/video images or videos resulting from medical examinations. Images or videos themselves are unstructured data, but they also include structured data like measurement values and their meta-data.

  • Chart analogue or digital charts that medical experts write or draw data by handwriting or using systems. These charts can have all types of data.

Among these materials, we focus on three types grayed in Table 3 and analyze each material to extract and unify their data based on semantic relations. These materials are related to CAG (coronary angiography) which is an essential medical examination for diagnosis and treatment of AMI.

The first clinical material shown in Fig. 3 is a coronary angiography. The sample is shown as captured images in Fig. 3, but it is a video type resulting from medical examination. We can define two groups of data which can be extracted from this material shown in Fig. 4. A group of metadata is properties of a video such as performing date, angles, and so on which are automatically attached when it is created. Semantic data means information which is supposed to be interpreted by medical experts. In other words, this group includes the location and condition of a lesion in terms of the main coronary artery, or its segments or branches.

Fig. 3
figure 3

Samples of CAG video

Fig. 4
figure 4

Analysis of data in CAG video

The second material is a coronary angiography report. It is a document which contains not only structured data like a patient’s information (age, gender, etc.) and date, but also semi-structured data. The large portion of this report is text, but they are written in typical patterns to mention a patient’s medical history, smoking habit, and symptoms. For the commonality of a data model, we created a template by extracting common factors from reports of four institutions. Figure 5 shows a sample template and Table 4 shows extracted data of the report.

Fig. 5
figure 5

A sample template of CAG report

Table 4 Analysis of data in CAG report

The final material is a coronary arteriogram shown in Fig. 6 which can describe essential information about the diagnoses, treatment procedures, and progress. In many hospitals, physicians use this material to mark the specific site and severity of lesions on a simplified map of coronary artery related to AMI. This material contains significant data depicted in Fig. 7.

Fig. 6
figure 6

A sample of coronary arteriogram

Fig. 7
figure 7

Analysis of data in coronary arteriogram

Data convergence modeling for AMI

As analyzing three materials in the previous section, they contain heterogeneous and redundant data. In addition, even though those data are semantically related to each other, they are distributed in different materials. That is, as shown in Fig. 8, information related to AMI is dispersed into three kinds of materials. However, some of data are duplicated and semantically associated with each other. Therefore, a new method is necessary to semantically converge and manage medical data instead of clinical materials like documents and images/videos. Table 5 shows the proposed data model for AMI defined by converging CAG-related materials of three types. That is, the specification can support to describe data which are stored in a video, a document, and an image of the current HISs.

Fig. 8
figure 8

Conversing of CAG-related data

Table 5 A convergence model of three types of CAG-related data for AMI

The data record includes essential elements about the diagnosis and treatment and consists of four groups as follows:

  • Patient a patient’s basic information such as personal information and coronary anatomical information, etc.

  • Physical history a patient’s states or habits which have a strong influence on AMI.

  • Vital history a patient’s basic medical states related to AMI.

  • Medical history information about the past diagnoses or treatments related to AMI. That is, this part includes locations and states of lesions, disease names about past diagnoses, and results of treatments.

In the current EMR systems and PACS, these data associated with each other are contained in heterogeneous materials. On the contrary, the convergence model shown in Table 5 can provide and manage essential information as a single record. Moreover, each data element is described with following properties shown in Table 6:

Table 6 Properties for elements in a record
  • Creator medical experts (physicians, inspectors, etc.) or institutions who created and modified each element.

  • Date the date on which creators created and changed the value of each element.

  • Importance the level of significance of each element. This property can have grades depending on whether the element is essential for the diagnosis and treatment, or on the degree that it influenced AMI.

  • Reference-metadata metadata for ‘reference’ elements of a ‘Medical History’ group. This property represents metadata of related materials (documents, images/videos, etc.) such as equipment models and angles of examinations.

The proposed data model shown in Table 5 can be a foundation to efficiently manage and promptly provide information which is crucial for a decision of diagnoses and treatments, particularly in the case of urgent diseases.

Conclusion

Data generated in medical fields are vast in volume and heterogeneous in terms of types, formats, and characteristics. However, the current HISs just store these data in different documents and images/videos. Therefore, it is difficult to efficiently provide relevant data in the golden time when experts need to view essential data for decision about urgent diseases like AMI. In order to usefully manage these data, a new data model should be defined based on the analysis of characteristics and semantic relations of data included in heterogeneous materials. In this paper, we proposed a convergence model to specify data which are essential for the diagnoses of AMI by analyzing three materials related to CAG. In contrast with the current HISs, the proposed model can unify semantic data contained in various materials as a single record. The convergence record will enable medical experts to easily and intuitively search important data for quick decision about diagnoses and treatments.

Starting with the convergence model proposed in this paper, we have a plan to consider the following challenging issues or future works:

  • Improvement of the convergence record in order to enhance the coverage and completeness of our data model, we will extend and refine its elements by analyzing all of essential materials.

  • Verification it is necessary to examine the feasibility or applicability of the data model to the HISs through medical experts.

  • Implementation and experiment we will implement a medical data provision and management system applying the model and will qualitatively and quantitatively compare its efficiency with the current HISs.