One-Pass, One-Hash n-Gram Statistics Estimation

Lemire, Daniel; Kaser, Owen

Computer Science > Databases

arXiv:cs/0610010 (cs)

[Submitted on 3 Oct 2006 (v1), last revised 4 Feb 2014 (this version, v4)]

Title:One-Pass, One-Hash n-Gram Statistics Estimation

Authors:Daniel Lemire, Owen Kaser

View PDF

Abstract:In multimedia, text or bioinformatics databases, applications query sequences of n consecutive symbols called n-grams. Estimating the number of distinct n-grams is a view-size estimation problem. While view sizes can be estimated by sampling under statistical assumptions, we desire an unassuming algorithm with universally valid accuracy bounds. Most related work has focused on repeatedly hashing the data, which is prohibitive for large data sources. We prove that a one-pass one-hash algorithm is sufficient for accurate estimates if the hashing is sufficiently independent. To reduce costs further, we investigate recursive random hashing algorithms and show that they are sufficiently independent in practice. We compare our running times with exact counts using suffix arrays and show that, while we use hardly any storage, we are an order of magnitude faster. The approach further is extended to a one-pass/one-hash computation of n-gram entropy and iceberg counts. The experiments use a large collection of English text from the Gutenberg Project as well as synthetic data.

Comments:	Fixed a typo
Subjects:	Databases (cs.DB); Computation and Language (cs.CL)
Report number:	TR-06-001
Cite as:	arXiv:cs/0610010 [cs.DB]
	(or arXiv:cs/0610010v4 [cs.DB] for this version)
	https://doi.org/10.48550/arXiv.cs/0610010

Submission history

From: Daniel Lemire [view email]
[v1] Tue, 3 Oct 2006 18:04:22 UTC (110 KB)
[v2] Wed, 11 Oct 2006 18:58:03 UTC (110 KB)
[v3] Tue, 17 Oct 2006 18:05:22 UTC (110 KB)
[v4] Tue, 4 Feb 2014 16:15:57 UTC (111 KB)

Computer Science > Databases

Title:One-Pass, One-Hash n-Gram Statistics Estimation

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Databases

Title:One-Pass, One-Hash n-Gram Statistics Estimation

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators