Google Scholar

Most significant substring mining based on chi-square measure

S Dutta, A Bhattacharya - … -Asia Conference on Knowledge Discovery and …, 2010 - Springer

Pacific-Asia Conference on Knowledge Discovery and Data Mining, 2010•Springer

Abstract

Given the vast reservoirs of sequence data stored worldwide, efficient mining of string databases such as intrusion detection systems, player statistics, texts, proteins, etc. has emerged as a great challenge. Searching for an unusual pattern within long strings of data has emerged as a requirement for diverse applications. Given a string, the problem then is to identify the substrings that differs the most from the expected or normal behavior, i.e., the substrings that are statistically significant (i.e., less likely to occur due to chance alone). To this end, we use the chi-square measure and propose two heuristics for retrieving the top-k substrings with the largest chi-square measure. We show that the algorithms outperform other competing algorithms in the runtime, while maintaining a high approximation ratio of more than 0.96.

Springer

Show moreShow less

Save Cite Cited by 9 Related articles All 6 versions

Showing the best result for this search. See all results

Cite

Advanced search

Saved to My library

Most significant substring mining based on chi-square measure