Abstract
Vast amounts of medical information are still recorded as unstructured text. The knowledge contained in this textual data has a great potential to improve clinical routine care, to support clinical research, and to advance personalization of medicine. To access this knowledge, the underlying data has to be semantically integrated – an essential prerequisite to which is information extraction from clinical documents.
A body of work, and a good selection of openly available tools for information extraction and semantic integration in the medical domain exist, yet almost exclusively for English language documents. For German texts the situation is rather different: research work is sparse, tools are proprietary or unpublished, and rarely any freely available textual resources exist. In this survey, we (1) describe the challenges of information extraction from German medical documents and the hurdles posed to research in this area, (2) especially address the problems of missing German language resources and privacy implications, and (3) identify the steps necessary to overcome these hurdles and fuel research in semantic integration of textual clinical data.
About the authors
Johannes Starlinger is a Research Associate at the Department of Computer Science at Humboldt-Universität zu Berlin. After studying medicine at Medical University of Vienna, and Computer Science at HU-Berlin, he joined the DFG-funded SOAMED graduate program in 2010 to research service-oriented architectures in a medical area of application. He received his PhD from HU-Berlin in 2015. Johannes' current research focus is on similarity search over data relevant to the biomedical domain, including scientific workflows, genomic, and medical data.
Humboldt-Universität zu Berlin, Institut für Informatik, Unter den Linden 6, 10099 Berlin, Germany
Madeleine Kittner studied chemistry at TU Berlin and University of Strathclyde Glasgow. In 2011, she received a PhD in biochemistry from Universität Potsdam, Germany. She has experience in analyzing transcriptomics data, signaling pathways and text mining of Dutch medical records. Currently, she is a research associate at the Department of Computer Science at Humboldt-Universität zu Berlin focusing on text mining of biomedical documents.
Humboldt-Universität zu Berlin, Institut für Informatik, Unter den Linden 6, 10099 Berlin, Germany
Oliver Blankenstein is a pediatrician at the Department of Pediatric Endocrinology and Diabetology and the head of the Newborn Screening Laboratory at Charité Universitätsmedizin Berlin. He also heads the department of endocrinology and metabolism at Labor Berlin. He has served as PI in a number of randomized controlled clinical trials and, for several years, has been advising computer science researchers in the DFG RTG SOAMED.
Charité Universitätsmedizin Berlin, Pädiatrische Endokrinologie und Diabetologie, Augustenburger Platz 1, 13353 Berlin, Germany
Ulf Leser studied informatics at TU München and obtained his PhD from TU Berlin. In 2002 be became professor for Knowledge Management in Bioinformatics at HU Berlin. His highly interdisciplinary research focuses on scientific data management, statistical Bioinformatics, biomedical text mining, and scientific workflows. He is speaker of the DFG-RTG SOAMED and a board member of the DFG-excellence RTG BSIO.
Humboldt-Universität zu Berlin, Institut für Informatik, Unter den Linden 6, 10099 Berlin, Germany
Acknowledgement
This work was partly funded by BMBF grants PERSONS (031L0030B) and PREDICT (031L0023A), and by DFG grant SOAMED (GRK1651).
©2016 Walter de Gruyter Berlin/Boston