Tokenization and the noiseless channel

V Zouhar, C Meister, JL Gastaldi, L Du… - arXiv preprint arXiv …, 2023 - arxiv.org
… In this case, one can equivalently think of noiseless channel encoding as compression. Thus…
tokenization functions as if our goal is to use them to communicate over a noiseless channel

Two Counterexamples to\textit {Tokenization and the Noiseless Channel}

M Cognetta, V Zouhar, S Moon, N Okazaki - arXiv preprint arXiv …, 2024 - arxiv.org
… for evaluating a tokenizer: for NLP tasks, the tokenizer which leads to the … tokenization
scheme that Rényi efficiency alone cannot capture. We describe two variants of BPE tokenization

Two Counterexamples to Tokenization and the Noiseless Channel

M Cognetta, V Zouhar, S Moon… - Proceedings of the 2024 …, 2024 - aclanthology.org
… for evaluating a tokenizer: for NLP tasks, the tokenizer which leads to the … tokenization
scheme that Rényi efficiency alone cannot capture. We describe two variants of BPE tokenization

Tokenization Is More Than Compression

CW Schmidt, V Reddy, H Zhang, A Alameddine… - arXiv preprint arXiv …, 2024 - arxiv.org
tokenization. To examine which other factors play a role, we evaluate design decisions across
all three phases of tokenization: pre-tokenization… of tokenization, we consider tokenization

Toward a Theory of Tokenization in LLMs

N Rajaraman, J Jiao, K Ramchandran - arXiv preprint arXiv:2404.08335, 2024 - arxiv.org
… With the addition of tokenization, however, we empirically observe that transformers break …
by transformers with and without tokenization. With the appropriate tokenization, we show that …

An Analysis of Tokenization: Transformers under Markov Data

N Rajaraman, J Jiao, K Ramchandran - The Thirty-eighth Annual … - openreview.net
… With the addition of tokenization, however, we empirically observe that transformers break …
by transformers with and without tokenization. With the appropriate tokenization, we show that …

The foundations of tokenization: Statistical and computational concerns

JL Gastaldi, J Terilla, L Malagutti, B DuSell… - arXiv preprint arXiv …, 2024 - arxiv.org
… for representing and analyzing tokenization models and establish various results for the
use of tokenizers, including the necessary and sufficient conditions for a tokenizer model to …

Understanding and Mitigating Tokenization Bias in Language Models

B Phan, M Havasi, M Muckley, K Ullrich - arXiv preprint arXiv:2406.16829, 2024 - arxiv.org
tokenization and language models setup in our paper. We then describe the next-character
sampling bias problem due to tokenization… constructed using any tokenization algorithm such …

Semantic Text Transmission via Prediction with Small Language Models: Cost-Similarity Trade-off

BA Madhabhavi, G Karevvanavar, RV Bhat… - arXiv preprint arXiv …, 2024 - arxiv.org
… a destination over noiseless and charactererasure channels. We … occurs over a noiseless
channel, the threshold policy … embedding vector for the jth tokenized word of lth input sequence …

Greed is all you need: An evaluation of tokenizer inference methods

O Uzan, CW Schmidt, C Tanner, Y Pinter - arXiv preprint arXiv:2403.01289, 2024 - arxiv.org
tokenizationtokenizer, that alignment of word segments to morphological gold-standard
segmentations is a predictor of the ability of a language model that uses the given tokenizer to …