Multi-strategies Integrated Information Extraction for Scholar Profiling Task

J Li, T Zhang, Y Wang, H Gui, X Tan, Z Wang - China Conference on …, 2021 - Springer
J Li, T Zhang, Y Wang, H Gui, X Tan, Z Wang
China Conference on Knowledge Graph and Semantic Computing, 2021Springer
Although the traditional information extraction tasks with labeled data sets are convenient for
model design and training, they are also limited by the labeled data sets. In contrast,
information extraction directly oriented to web search results is more flexible, practical and
challenging. The evaluation task of CCKS-2021 “Aminer Scholar Profiling” requires accurate
extraction of character attributes in the limited search range. A group of web information
extraction methods based on multi-strategies integration are proposed for the task:(1) give …
Abstract
Although the traditional information extraction tasks with labeled data sets are convenient for model design and training, they are also limited by the labeled data sets. In contrast, information extraction directly oriented to web search results is more flexible, practical and challenging. The evaluation task of CCKS-2021 “Aminer Scholar Profiling” requires accurate extraction of character attributes in the limited search range. A group of web information extraction methods based on multi-strategies integration are proposed for the task: (1) give priority to extracting attributes from semi-structured web page tags, otherwise try to mine from unstructured webpage text; (2) transform the unstructured attribute extraction tasks into text classification tasks, and construct training data sets for them respectively; (3) design a special OCR method to recognize the text attributes embedded in the images. Using the above strategies and methods, the accuracy in the validation set and test set reached 75.68 and 74.84 respectively, and finally won the first place in this evaluation task. When deep learning algorithms develop to a relatively mature stage on some specific tasks, taking advantage of the characteristics of the business and pre-processing of the data are more effective than tuning of the model.
Springer
Showing the best result for this search. See all results