A lexical approach to identifying subtype inconsistencies in biomedical terminologies

R Abeysinghe, F Zheng, EW Hinderer… - 2018 IEEE …, 2018 - ieeexplore.ieee.org
2018 IEEE International Conference on Bioinformatics and …, 2018ieeexplore.ieee.org
We introduce a lexical-based inference approach for identifying subtype (or is a relation)
inconsistencies in biomedical terminologies. Given a terminology, we first represent the
name of each concept in the terminology as a sequence of words. We then generate
hierarchically-linked and-unlinked pairs of concepts, such that the two concepts in a pair
have the same number of words, and contain at least one word in common and a fixed
number n of different words (n= 1, 2, 3, 4, 5). The linked and unlinked concept-pairs further …
We introduce a lexical-based inference approach for identifying subtype (or is a relation) inconsistencies in biomedical terminologies. Given a terminology, we first represent the name of each concept in the terminology as a sequence of words. We then generate hierarchically-linked and -unlinked pairs of concepts, such that the two concepts in a pair have the same number of words, and contain at least one word in common and a fixed number n of different words (n = 1, 2, 3, 4, 5). The linked and unlinked concept-pairs further infer corresponding linked and unlinked term-pairs, respectively. If a linked concept-pair and an unlinked concept-pair infer the same term-pair, we consider this as a potential subtype inconsistency, which may indicate a missing subtype relation or an incorrect subtype relation. We applied this approach to Gene Ontology (GO), National Cancer Institute thesaurus (NCIt) and SNOMED CT. A total of 4,841 potential subtype inconsistencies were found in GO, 2,677 in NCIt, and 53,782 in SNOMED CT. Domain experts evaluated a random sample of 211 potential inconsistencies in GO, and verified that 124 of them are valid (i.e., a precision of 58.77% for detecting subtype inconsistencies in GO). We also performed a preliminary study on the extent to which external knowledge in the Unified Medical Language System (UMLS) can provide supporting evidence for validating the detected potential inconsistencies: 0.54% (=26/4841) for GO, 11.43% (=306/2677) for NCIt, and 3.61% (=1940/53782) for SNOMED CT. Results indicate that our lexical-based inference approach is a promising way to identify subtype inconsistencies and facilitates the quality improvement of biomedical terminologies.
ieeexplore.ieee.org
Showing the best result for this search. See all results