User profiles for Woosuk Kwon
Woosuk KwonPhD student, UC Berkeley Verified email at berkeley.edu Cited by 1125 |
Efficient memory management for large language model serving with pagedattention
High throughput serving of large language models (LLMs) requires batching sufficiently
many requests at a time. However, existing systems struggle because the key-value cache (KV …
many requests at a time. However, existing systems struggle because the key-value cache (KV …
Feedback stabilization of linear systems with delayed control
W Kwon, A Pearson - IEEE Transactions on Automatic control, 1980 - ieeexplore.ieee.org
Feedback controls based on the receding horizon method have proven to be a useful and
easy tool in stabilizing linear ordinary differential systems. In this paper the receding horizon …
easy tool in stabilizing linear ordinary differential systems. In this paper the receding horizon …
A fast post-training pruning framework for transformers
Pruning is an effective way to reduce the huge inference cost of Transformer models. However,
prior work on pruning Transformers requires retraining the models. This can add high …
prior work on pruning Transformers requires retraining the models. This can add high …
Graphene: Strong yet lightweight row hammer protection
Row Hammer is a serious security threat to modern computing systems using DRAM as main
memory. It causes charge loss in DRAM cells adjacent to a frequently activated aggressor …
memory. It causes charge loss in DRAM cells adjacent to a frequently activated aggressor …
Nimble: Lightweight and parallel gpu task scheduling for deep learning
Deep learning (DL) frameworks take advantage of GPUs to improve the speed of DL inference
and training. Ideally, DL frameworks should be able to fully utilize the computation power …
and training. Ideally, DL frameworks should be able to fully utilize the computation power …
{SkyPilot}: An intercloud broker for sky computing
To comply with the increasing number of government regulations about data placement and
processing, and to protect themselves against major cloud outages, many users want the …
processing, and to protect themselves against major cloud outages, many users want the …
Learned token pruning for transformers
Efficient deployment of transformer models in practice is challenging due to their inference
cost including memory footprint, latency, and power consumption, which scales quadratically …
cost including memory footprint, latency, and power consumption, which scales quadratically …
The ATSC link-layer protocol (ALP): Design and efficiency evaluation
W Kwon, J Hwang, HK Yang, S Hwang… - IEEE Transactions …, 2016 - ieeexplore.ieee.org
In this paper, a novel data link layer protocol adopted in the ATSC 3.0 terrestrial broadcast
standard is described. The data link layer is a protocol layer between the physical layer and …
standard is described. The data link layer is a protocol layer between the physical layer and …
Risk factors for early septic failure after two-stage exchange total knee arthroplasty for treatment of periprosthetic joint infection
…, KK Park, BW Cho, JY Park, I Kim, HM Kwon - Journal of Orthopaedics …, 2024 - Springer
Background The cause of early septic failure after two-stage exchange revision total knee
arthroplasty (TKA) for chronic periprosthetic joint infection (PJI) and the factors affecting it are …
arthroplasty (TKA) for chronic periprosthetic joint infection (PJI) and the factors affecting it are …
Optimizing Speculative Decoding for Serving Large Language Models Using Goodput
Reducing the inference latency of large language models (LLMs) is crucial, and speculative
decoding (SD) stands out as one of the most effective techniques. Rather than letting the …
decoding (SD) stands out as one of the most effective techniques. Rather than letting the …