Most significant substring mining based on chi-square measure

S Dutta, A Bhattacharya - … -Asia Conference on Knowledge Discovery and …, 2010 - Springer
Pacific-Asia Conference on Knowledge Discovery and Data Mining, 2010Springer
Given the vast reservoirs of sequence data stored worldwide, efficient mining of string
databases such as intrusion detection systems, player statistics, texts, proteins, etc. has
emerged as a great challenge. Searching for an unusual pattern within long strings of data
has emerged as a requirement for diverse applications. Given a string, the problem then is to
identify the substrings that differs the most from the expected or normal behavior, ie, the
substrings that are statistically significant (ie, less likely to occur due to chance alone). To …
Abstract
Given the vast reservoirs of sequence data stored worldwide, efficient mining of string databases such as intrusion detection systems, player statistics, texts, proteins, etc. has emerged as a great challenge. Searching for an unusual pattern within long strings of data has emerged as a requirement for diverse applications. Given a string, the problem then is to identify the substrings that differs the most from the expected or normal behavior, i.e., the substrings that are statistically significant (i.e., less likely to occur due to chance alone). To this end, we use the chi-square measure and propose two heuristics for retrieving the top-k substrings with the largest chi-square measure. We show that the algorithms outperform other competing algorithms in the runtime, while maintaining a high approximation ratio of more than 0.96.
Springer
Showing the best result for this search. See all results