Bidirectional string anchors (bd-anchors) is a new string sampling mechanism. Given a positive integer ℓ, the mechanism selects the leftmost lexicographically smallest rotation in every sliding window of length ℓ of the input text.
Bd-anchors samples are approximately uniform, locally consistent, and computable in O(n) time, for any input text of length n and any ℓ --- our current implementation supports an O(nℓ)-time construction.
Our experiments using several datasets show that the bd-anchors sample sizes decrease proportionally to ℓ; and that these sizes are competitive to or smaller than the minimizers sample sizes using the analogous sampling parameters. For instance, for the Chromosome 1 of human genome, which is of length n = 230,481,390, and ℓ = 500 (resp. 1000), the set A of order-ℓ bd-anchors is of size 1,560,882 (resp. 897,953).
Constructing the Sample: Our current implementation takes O(nℓ) time. To compile the program, change to directory bd-construct and follow the instructions given in file INSTALL.
We inject bd-anchors in two problems:
Text Indexing: Our index has size n bytes + O(|A|) integers and supports locate operations for any pattern of length at least ℓ in near-optimal time (bd-index-grid) --- the time supported in the bd-index implementation is not bounded, but bd-index is considerably faster in practice, especially when the number of occurrences is high. To compile the program, change to directory bd-index or bd-index-grid and follow the instructions given in file INSTALL.
Top-K Similarity Search under Edit Distance: To compile the program, change to directory bd-search and follow the instructions given in file INSTALL.
When publishing work that is based on the results from bd-anchors please cite:
G. Loukides, S. P. Pissis, M. Sweering:
Bidirectional String Anchors for Improved Text Indexing and Top-K Similarity Search.
IEEE Trans. Knowl. Data Eng. DOI: 10.1109/TKDE.2022.3231780
G. Loukides and S. P. Pissis:
Bidirectional String Anchors: a New String Sampling Mechanism.
ESA 2021: 64:1-64:21. DOI: 10.4230/LIPIcs.ESA.2021.64
License: GNU GPLv3 License; Copyright (C) 2021 Grigorios Loukides and Solon P. Pissis.