A formal perspective on byte-pair encoding
arXiv preprint arXiv:2306.16837, 2023•arxiv.org
Byte-Pair Encoding (BPE) is a popular algorithm used for tokenizing data in NLP, despite
being devised initially as a compression method. BPE appears to be a greedy algorithm at
face value, but the underlying optimization problem that BPE seeks to solve has not yet been
laid down. We formalize BPE as a combinatorial optimization problem. Via submodular
functions, we prove that the iterative greedy version is a $\frac {1}{{\sigma (\boldsymbol
{\mu}^\star)}}(1-e^{-{\sigma (\boldsymbol {\mu}^\star)}}) $-approximation of an optimal merge …
being devised initially as a compression method. BPE appears to be a greedy algorithm at
face value, but the underlying optimization problem that BPE seeks to solve has not yet been
laid down. We formalize BPE as a combinatorial optimization problem. Via submodular
functions, we prove that the iterative greedy version is a $\frac {1}{{\sigma (\boldsymbol
{\mu}^\star)}}(1-e^{-{\sigma (\boldsymbol {\mu}^\star)}}) $-approximation of an optimal merge …
Byte-Pair Encoding (BPE) is a popular algorithm used for tokenizing data in NLP, despite being devised initially as a compression method. BPE appears to be a greedy algorithm at face value, but the underlying optimization problem that BPE seeks to solve has not yet been laid down. We formalize BPE as a combinatorial optimization problem. Via submodular functions, we prove that the iterative greedy version is a -approximation of an optimal merge sequence, where is the total backward curvature with respect to the optimal merge sequence . Empirically the lower bound of the approximation is . We provide a faster implementation of BPE which improves the runtime complexity from to , where is the sequence length and is the merge count. Finally, we optimize the brute-force algorithm for optimal BPE using memoization.
arxiv.org
Showing the best result for this search. See all results