A Fast Text Similarity Measure for Large Document Collections using Multi-reference Cosine and Genetic Algorithm

Mohammadi, Hamid; Khasteh, Seyed Hossein

Computer Science > Information Retrieval

arXiv:1810.03102 (cs)

[Submitted on 7 Oct 2018 (v1), last revised 24 Sep 2019 (this version, v3)]

Title:A Fast Text Similarity Measure for Large Document Collections using Multi-reference Cosine and Genetic Algorithm

Authors:Hamid Mohammadi, Seyed Hossein Khasteh

View PDF

Abstract:One of the important factors that make a search engine fast and accurate is a concise and duplicate free index. In order to remove duplicate and near-duplicate documents from the index, a search engine needs a swift and reliable duplicate and near-duplicate text document detection system. Traditional approaches to this problem, such as brute force comparisons or simple hash-based algorithms are not suitable as they are not scalable and are not capable of detecting near-duplicate documents effectively. In this paper, a new signature-based approach to text similarity detection is introduced which is fast, scalable, reliable and needs less storage space. The proposed method is examined on popular text document data-sets such as CiteseerX, Enron, Gold Set of Near-duplicate News Articles and etc. The results are promising and comparable with the best cutting-edge algorithms, considering the accuracy and performance. The proposed method is based on the idea of using reference texts to generate signatures for text documents. The novelty of this paper is the use of genetic algorithms to generate better reference texts.

Comments:	8 pages, 2 figures, 5 tables, 4 equation
Subjects:	Information Retrieval (cs.IR)
Cite as:	arXiv:1810.03102 [cs.IR]
	(or arXiv:1810.03102v3 [cs.IR] for this version)
	https://doi.org/10.48550/arXiv.1810.03102

Submission history

From: Hamid Mohammadi [view email]
[v1] Sun, 7 Oct 2018 08:17:07 UTC (141 KB)
[v2] Tue, 16 Oct 2018 13:08:22 UTC (290 KB)
[v3] Tue, 24 Sep 2019 18:11:47 UTC (397 KB)

Computer Science > Information Retrieval

Title:A Fast Text Similarity Measure for Large Document Collections using Multi-reference Cosine and Genetic Algorithm

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Information Retrieval

Title:A Fast Text Similarity Measure for Large Document Collections using Multi-reference Cosine and Genetic Algorithm

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators