Stop-Word Lists in Keyphrase Extraction: Their Influence and Comparison
Abstract
Keyphrases provide a compact representation of a document‘s content and are useful in Websearch systems, text data mining, and natural language processing applications. The keyphrase extraction domain has been developing for a long time, and achieving further improvements is becoming increasingly challenging. Algorithms compete for minimal gains, highlighting the significance of demonstrating ways to enhance the quality of both existing algorithms and thoseyet to be developed. This article aims to demonstrate and approve a simple way to enhance keyphrase extraction algorithms by using extended stop words. This enables the improvement of keyphrase extraction algorithms on average by 4% and more. Nevertheless,no studies have been conducted that compare different stop-word lists and their impact on the domain. Our goalis to over come this gap. We compared the impact of bothexisting extended and standard stop-word lists on the performance of 10 unsupervised keyphrase extraction algorithms across 5 datasets (a total of 10 sub-datasets were used). We aimed to highlight that researching methods for constructing and using extended stop-wordlists deserves attention and could become one of the subdirections in the keyphrase extraction domain. Extended stop words, when a suitable list is selected, consistently enhance the performance of algorithms in a stable and statistically significant manner. Based on the obtained results, we can assume that knowing the type of text from which keyphrases need to be extracted allows us to select the most appropriate stop-word list.
Keywords
Keyphrase extraction, stop words, NLP