Authors:
Chris J. Lu
1
;
Destinee Tormey
1
;
Lynn McCreedy
1
and
Allen C. Browne
2
Affiliations:
1
National Library of Medicine, Medical Science & Computing and LLC, United States
;
2
National Library of Medicine, United States
Keyword(s):
MEDLINE N-Gram Set, Multiwords, Medical Language Processing, Natural Language Processing, the SPECIALIST Lexicon.
Related
Ontology
Subjects/Areas/Topics:
Artificial Intelligence
;
Biomedical Engineering
;
Data Mining
;
Databases and Information Systems Integration
;
Enterprise Information Systems
;
Health Information Systems
;
Practice-based Research Methods for Healthcare IT
;
Sensor Networks
;
Signal Processing
;
Soft Computing
Abstract:
Multiwords are vital to better Natural Language Processing (NLP) systems for more effective and efficient parsers, refining information retrieval searches, enhancing precision and recall in Medical Language Processing (MLP) applications, etc. The Lexical Systems Group has enhanced the coverage of multiwords in the Lexicon to provide a more comprehensive resource for such applications. This paper describes a new systematic approach to lexical multiword acquisition from MEDLINE through filters and matchers based on empirical models. The design goal, function description, various tests and applications of filters, matchers, and data are discussed. Results include: 1) Generating a smaller (38%) distilled MEDLINE n-gram set with better precision and similar recall to the MEDLINE n-gram set; 2) Establishing a system for generating high precision multiword candidates for effective Lexicon building. We believe the MLP/NLP community can benefit from access to these big data (MEDLINE n-gram) s
ets. We also anticipate an accelerated growth of multiwords in the Lexicon with this system. Ultimately, improvement in recall or precision can be anticipated in NLP projects using the MEDLINE distilled n-gram set, SPECIALIST Lexicon and its applications.
(More)