The Requirements For Ontologies in Medical Data Integration: A Case Study
The Requirements For Ontologies in Medical Data Integration: A Case Study
The Requirements For Ontologies in Medical Data Integration: A Case Study
Case Study
Abstract
Evidence-based medicine is critically dependent on three sources of information: a
medical knowledge base, the patients medical record and knowledge of available
resources, including, where appropriate, clinical protocols. Patient data is often scattered
in a variety of databases and may, in a distributed model, be held across several disparate
repositories. Consequently addressing the needs of an evidence-based medicine community
presents issues of biomedical data integration, clinical interpretation and knowledge
management. This paper outlines how the Health-e-Child project has approached the
challenge of requirements specification for (bio-) medical data integration, from the level of
cellular data, through disease to that of patient and population. The approach is
illuminated through the requirements elicitation and analysis of Juvenile Idiopathic
Arthritis (JIA), one of three diseases being studied in the EC-funded Health-e-Child project.
The diagnosis is clinical, mostly negative, while the origin is probably molecular.
Assessments are based on imaging, while the treatment is local (molecular, organ), as
well as global (molecular). How do these interact?
How can the clinical variables for disease activity and damage be standardised?
Correlations can be made between patient assessment and disease progression. Which
are the early predictors of damage?
What are the short and long term effects of treatment procedures?
Are there markers which distinguish between reversible and permanent organ damage?
Can we identify more homogeneous groups of patients for better classification of JIA
subtypes and better planning for treatment response?
Are there any hereditary elements, do these correlate with other autoimmune diseases?
Knowledge Discovery (KD) algorithms using statistical learning and data mining
techniques to find a better classification of the JIA (homogeneous groups of patients)
based on the joint analyses of all available heterogeneous biomedical data (e.g.
clinical, imaging, genomics, proteomics). Currently the JIA diagnoses and treatment
are primarily based on the clinical assessment. The imaging and laboratory tests assist
clinicians in the evaluation of the disease progression and drug outcome but do not
serve as early predictors or indicate the patients sensitivity to a particular treatment or
risk of poor outcome.
Image-based techniques including semi-automatic segmentation of the synovitis to
speed-up and improve the scoring (an inflamed synovia might be an early predictor of
the severity of the disease); semi-automatic evaluation of joint damage; image
registration across time to be able to compare the different studies of the same patient
or between different (groups of) patients.
Decision Support System (DSS) for individualized evaluation/treatment by monitoring
disease progression and treatment outcome based on available biomedical inputs and
previously established clinical knowledge.
1. Identify those MRI baseline measures (degree of synovitis, bone marrow oedema etc)
that are most predictive of future severe radiological damage. (Correlation of the MRI
baseline variables with the change in the erosion score from baseline to 1- 2 years.)
2. Identify more homogeneous groups of patients (suitable for aetiopathogenetic studies)
taking account clinical assessment, immunologic, genetic, proteomic and radiological
findings.
3. Identify a panel of candidate protein biomarkers in JIA that can predict which patients
will develop erosive, disabling diseases. (Are the serum and synovial protein profiles
reflective of active disease and /or predictive of disease course?).
All these examples stress the importance of the integrating the diverse medical data
(clinical, epidemiological, imaging, genomic, proteomic etc.) that represent the patients
information at different levels of granularity (vertical levels). The semantic link between
these levels is obvious: entities are in a part-of hierarchy, but the extreme complexity of the
human body and its processes usually do not allow for drawing straightforward conclusions
from parts to the whole. To establish a basis for the semantic coherence of the integrated
data and facilitate availability and accessibility of external information for querying and
analysis by clinicians, the mappings from clinical data to the external medical knowledge
(e.g. biomedical ontologies and databases) should be provided. For example, to answer the
third question the clinical data needs to be aligned with the external knowledge sources to
identify genetic markers that can be present in JIA.
Another important requirement for our modelling is the temporal dimension. Time is a
key issue in paediatric research and practice. Clinicians are usually interested in analysing
patients data over time (see the first example). The paediatrics domain adds an additional
complexity due to the fact that the child is growing and the observations in time should be
aligned with the anatomical changes as well as the knowledge about how a particular
disease may affect these changes. The clinical process usually follows a given time order
(symptoms, study, diagnosis, treatment, follow-up, etc.). In addition, some
symptom/diagnosis and treatment concepts are time-related (for example, a diagnostic
criterion for JIA is persistence of some symptoms for a given time; medication is prescribed
with a time profile, etc).
4. Ontologies in Health-e-Child
The standard textbook definition of an ontology is a formal specification of a shared
conceptualisation. This means that an ontology represents a shared, agreed and detailed
model of a problem domain. We are currently developing and investigating an ontological
approach to represent the HeC domain. One advantage for the use of ontologies is their
ability to resolve any semantic heterogeneity that is present within the data. Ontologies
define links between different types of semantic knowledge. They can particularly aid in the
resolution of terms for queries and other general search strategies, thus improving the
search results that are presented to clinicians. The fact that ontologies are machine
processable and human understandable is especially useful in this regard [8]. A complete
discussion of ontologies is beyond the scope of this paper; the interested reader is referred
to [9]. There are many ontologies in existence today especially in the biomedical domain,
however they are often limited to one level of what we refer to as vertical integration. For
example consider the Gene Ontology (GO) [6] which only defines structures regarding
genes and GALEN [5] that is limited to anatomical concepts. In both cases there are no
links to the other vertical levels that we have defined. We are currently investigating the
scope for reusing these ontologies, or parts thereof, which have been identified by experts in
both knowledge representation and clinical matters.
Many of the ontologies that exist today do not cover the paediatrics domain, to a
thorough extent, for example there is a difference between the physiology of a fully grown
adult and that of a child; there are also some similarities for example they both have a two
lungs. Hence, it would not be sensible to reuse these ontologies in their entirety; instead we
propose the extraction of the relevant parts and then the integration of these into a coherent
whole, thereby capturing most of the HeC domain. However it should be noted that
integrating these ontologies into one single (upper level) ontology will not be sufficient to
capture the entire HeC domain, and therefore we will have to model the missing attributes
and extend these existing ontologies to suit our needs. Although there are other upper level
ontologies present today, such as DOLCE [10] and SUMO [11], they are considered to be
too broad to be included in the project.
The ontology modelling process is known as ontology engineering. The traditional
ontology engineering process is an iterative process consisting of ontology modelling and
ontology validation [12]. Taking this view of ontology engineering, we have chosen to
evaluate different methodologies available to us for the development of our vertical domain
model. There are many methods available in the literature, for example CommonKads [13]
and Diligent [14]; this evaluation process is ongoing. A methodology that deserves special
consideration in this paper is proposed by Seidenberg and Rector [15] in which a strategy
for modular development of ontologies is proposed, to support the re-use, maintainability
and evolution of the ontology to be developed. This methodology consists of untangling the
ontology into disjoint independent trees which can be recombined into an ontology using
definitions and axioms to represent the relationships in an explicit fashion. To facilitate this
modular methodology we are also taking under consideration the use of a fragment-based
approach for the development of our domain models.
Data integration is the process of using a conceptual representation of the data and of
their relationships to eliminate possible heterogeneities. Ontologies are extensively used in
data integration systems because they provide an explicit and machine-understandable
conceptualization of a domain. There are several approaches to data integration which we
will now consider in further detail as described by Wache et al. in their article [16]. In the
single ontology approach, all source schemas are directly related to a shared global
ontology that provides a uniform interface to the user. However, this approach requires that
all sources have nearly the same view on a domain, with the same level of granularity. A
typical example of a system using this approach is SIMS [17]. In the multiple ontology
approach, each data source is described by its own (local) ontology separately. Instead of
using a common ontology, local ontologies are mapped to each other. For this purpose, an
additional representation formalism is necessary for defining the inter-ontology mappings.
The OBSERVER system [18] is an example of this approach. In the hybrid ontology
approach, a combination of the two preceding approaches is used. `In the hybrid approach a
local ontology is built for each source schema, which is not mapped to other local
ontologies, but to a global shared ontology. New sources can be added with no need for
modifying existing mappings. The layered framework [19] is an example of this approach.
The single and hybrid approaches (see figure 2) are appropriate for building central data
integration systems, the former being more appropriate for so-called Global-as-View (GaV)
systems and the latter for Local-as-View (LaV) systems. One drawback associated with the
single global approach is the need for maintenance when new information sources are added
to the representation. The hybrid architecture allows for greater flexibility in this regard
with new sources being represented at the local level. The multiple ontology approach can
be best used to construct pure peer-to-peer data integration systems, where there are no
super-peers.
Figure 2: The three possible ways for using ontologies for data integration [16]
An interesting research area within the scope of the HeC project is the exposure of the
HeC knowledge base to the outside world. One example of where this can be useful is to aid
in communication between outside sources allowing querying of the HeC system with a
user that speaks an outside ontology language. For example, if a common concept is found
between the HeC ontology and an external ontology such as the FMA (Foundation Model
of Anatomy) [7] then a query, which was originally formulated for the FMA ontology, can
also be processed within the HeC domain via the common links that are found between the
two ontologies. Much previous work has already been conducted on constructing (semi)
automatic mappings between ontologies this being referred to as ontology alignment. The
HeC project is currently investigating existing methods and creating new approaches to
facilitate the alignment of the HeC ontology with external ontologies. This will facilitate the
knowledge sharing of the HeC domain via the ontology and will aid in its reuse.
Another interesting feature of ontologies is that they can aid in the creation of similarity
metrics. This has already been attempted by many projects for example in [26] [27] to
gauge the similarity between genes using the GO ontology and by Resnik in [28] to gauge
the similarity between different words with the WordNet Thesaurus. This technique can aid
in the integration of the other sub-projects in HeC such as the decision support system by
creating a similarity metric based on the HeC ontology, hence creating a common base for
the training and classification phases of the DSS. One other area where the HeC ontology
can be used within the project is the creation of ontology based training data, for example to
classify different diseases, this is can be done by using the rule base of the ontology within
an expert system [29]. The HeC ontology can be used to annotate different data sets such as
images for easy access later, hence creating a semantic image database.
5. Conclusions
In this paper we have used JIA as one disease to illustrate concretely the kind of
medical problems we are trying to solve in the HeC project. Many of those are not JIA
specific but appear in other areas of medicine possibly with different weights of relevance,
importance etc. The Health-e-Child project aims to provide generic solutions without
focusing on one particular study. We have selected three considerably different disease
areas (paediatric heart diseases, Juvenile Idiopathic Arthritis (JIA) and child brain tumours)
to investigate the problems related to differences and commonalities across the paediatric
domain. Clinical requirements have been collected during the elicitation sessions with the
medical experts and these requirements have been driving the development of the
technological solutions to tackle these problems. The integration of the diverse medical data
(clinical, epidemiological, imaging, genomic, proteomic etc.) that represent the patients
information at different levels of granularity is very important as the clinical knowledge will
span across different medical disciplines allowing clinicians to discover interesting findings
and infer new medical knowledge. In addition, the clinical work flows can be quite different
in different medical areas (as is exemplified by the three different diseases in our project),
but the patient journey can be viewed as the composition of similar tasks (e.g. baseline,
diagnosis, treatment, follow-up etc.) for which a common model based on the reusable
formalized process patterns should be used. As indicated, future work in the project will
enable appropriate knowledge representations including ontologies to be implemented to
aid the process of vertical data integration and address the differences across these three
disease domains.
6. References
[1] The Information Societies Technology Project: Health-e-Child EU Contract IST-2004-027749.
[2] Stevens et al., Using OWL to model biological knowledge. International Journal of Human Computer
Studies, 2006.
[3] Baker et al, An ontology for bioinformatics applications, Bioinformatics, 15(6):510-520, 1999.
[4] Paediatric Rheumatology: http://www.printo.it/pediatric-rheumatology/, Accessed 4/4/2007.
[5] Rector, A. L., Rogers, J. E., Zanstra, P. E., Van Der Haring, E., . OpenGALEN: Open source medical
terminology and tools. In Proceedings of the American Medical Informatics Association Symposium 2003,
pages 982-985.
[6] Ashburner M. et al, Gene Ontology: tool for the unification of biology. National Genetics 2000, Vol 25:
pages 25-29.
[14] Christoph Tempich, H. Sofia Pinto, York Sure and Steffen Staab, An argumentation ontology for
DIstributed, Loosely-controlled and evolvInG Engineering processes of oNTologies (DILIGENT), Lecture
Notes in Computer Science , Volume 3532/2005 , 241-256
[15] Seidenberg, J. and Rector, A. 2006. Web ontology segmentation: analysis, classification and use. In
Proceedings of the 15th international Conference on World Wide Web (Edinburgh, Scotland, May 23 - 26,
2006). WWW '06. ACM Press, New York, NY, 13-22.
[16] H.Wache et al,, Ontology-Based Integration of Information A Survey of Existing Approaches, IJCAI
Workshop on Ontologies and Information Sharing, 2001
[17] Arens, Y. Knoblock, C. A. Shen, W. M. , Query Reformulation for Dynamic Information Integration,
Journal of Intelligent Information Systems, 1996, VOL 6; NUMBER 2, pages 99-130, KLUWER ACADEMIC
PUBLISHERS, Netherlands.
[18] Mena, E. Illarramendi, A. Kashyap, V. Sheth, A. P. , OBSERVER: An Approach for Query Processing in
Global Information Systems Based on Interoperation Across Pre-Existing Ontologies, Distributed and Parallel
Databases, 2000, VOL 8; PART 2, pages 223-271 , KLUWER ACADEMIC PUBLISHERS, Netherlands.
[19] Cruz, I.F. Huiyong Xiao , Using a layered approach for interoperability on the semantic Web, Web
Information Systems Engineering, 2003. WISE 2003, pages 221- 231 ,
[21] NF Noy, MA Musen , The PROMPT suite: interactive tools for ontology merging and mapping,
International Journal of Human-Computer Studies, 2003 - Elsevier, Vol 59, pages 9831024.
[22] G. Stumme, A. Madche, FCA-Merge: Bottom-Up Merging of Ontologies, Proc. 17th Intl. Conf. on
Articial Intelligence (IJCAI 01), Seattle, WA, USA, 2001, pages 225-230.
[24] Blackburn, P. Seligman, J. , Hybrid Languages, Journal of Logic Language and Information , 1995, VOL
4; , pages 251-272 , Kluwer Acaademic Publishers
[25] Bodenreider, O. ,The Unified Medical Language System (UMLS): integrating biomedical terminology,
Nucleic Acids Research Journal, 2004, VOL 32; NUMBER 1; SUPP, pages 267-270, Oxford University Press.
[26] Amandeep S. Sidhu, Tharam S. Dillon, Elizabeth Chang, "Advances in Protein Ontology Project," cbms,
pp. 588-592, 19th IEEE Symposium on Computer-Based Medical Systems (CBMS'06), 2006.
[27] Sevilla, J. L., Segura, V., Podhorski, A., Guruceaga, E., Mato, J. M., Martinez-Cruz, L. A., Corrales, F. J.,
and Rubio, A. 2005. Correlation between Gene Expression and GO Semantic Similarity. IEEE/ACM Trans.
Comput. Biol. Bioinformatics 2, 4 (Oct. 2005), 330-338.
[28] Philip Resnik, Disambiguating noun groupings with respect to WordNet senses, In Third Workshop on
Very Large Corpora. Association for Computational Linguistics, 1995.
[29] JESS, the Rule Engine for the JavaTM Platform, http://herzberg.ca.sandia.gov/jess/, Accessed 2/4/2007