Chatterbox: Robust Transport for LLM Token Streaming under Unstable Network

H Li, Y Liu, Y Cheng, S Ray, K Du, J Jiang - arXiv preprint arXiv …, 2024 - arxiv.org
arXiv preprint arXiv:2401.12961, 2024arxiv.org
To render each generated token in real time, the LLM server generates response tokens one
by one and streams each generated token (or group of a few tokens) through the network to
the user right after it is generated, which we refer to as LLM token streaming. However,
under unstable network conditions, the LLM token streaming experience could suffer greatly
from stalls since one packet loss could block the rendering of tokens contained in
subsequent packets even if they arrive on time. With a real-world measurement study, we …
To render each generated token in real time, the LLM server generates response tokens one by one and streams each generated token (or group of a few tokens) through the network to the user right after it is generated, which we refer to as LLM token streaming. However, under unstable network conditions, the LLM token streaming experience could suffer greatly from stalls since one packet loss could block the rendering of tokens contained in subsequent packets even if they arrive on time. With a real-world measurement study, we show that current applications including ChatGPT, Claude, and Bard all suffer from increased stall under unstable network. For this emerging token streaming problem in LLM Chatbots, we propose a novel transport layer scheme, called Chatterbox, which puts new generated tokens as well as currently unacknowledged tokens in the next outgoing packet. This ensures that each packet contains some new tokens and can be independently rendered when received, thus avoiding aforementioned stalls caused by missing packets. Through simulation under various network conditions, we show Chatterbox reduces stall ratio (proportion of token rendering wait time) by 71.0% compared to the token streaming method commonly used by real chatbot applications and by 31.6% compared to a custom packet duplication scheme. By tailoring Chatterbox to fit the token-by-token generation of LLM, we enable the Chatbots to respond like an eloquent speaker for users to better enjoy pervasive AI.
arxiv.org