FaaSwap: SLO-Aware, GPU-Efficient Serverless Inference via Model Swapping

Yu, Minchen; Wang, Ao; Chen, Dong; Yu, Haoxuan; Luo, Xiaonan; Li, Zhuohao; Wang, Wei; Chen, Ruichuan; Nie, Dapeng; Yang, Haoran

Computer Science > Distributed, Parallel, and Cluster Computing

arXiv:2306.03622 (cs)

[Submitted on 6 Jun 2023 (v1), last revised 8 Feb 2024 (this version, v2)]

Title:FaaSwap: SLO-Aware, GPU-Efficient Serverless Inference via Model Swapping

Authors:Minchen Yu, Ao Wang, Dong Chen, Haoxuan Yu, Xiaonan Luo, Zhuohao Li, Wei Wang, Ruichuan Chen, Dapeng Nie, Haoran Yang

View PDF

Abstract:Serverless computing has become increasingly popular for machine learning inference. However, current serverless platforms lack efficient support for GPUs, limiting their ability to deliver low-latency inference. In this paper, we propose FaaSwap, a GPU-efficient serverless inference platform. FaaSwap employs a holistic approach to system and algorithm design. It maintains models in main memory and dynamically swaps them onto GPUs upon request arrivals (i.e., late binding), thereby enabling a large number of inference functions to efficiently share a node's GPUs. FaaSwap uses various techniques, including asynchronous API redirection, GPU runtime sharing, pipelined model execution, and efficient GPU memory management, to achieve the optimal performance. We also develop an interference-aware request scheduling algorithm that allows FaaSwap to meet the latency SLOs for individual inference functions. We have implemented FaaSwap as a prototype on a leading commercial serverless platform. Experimental evaluations demonstrate that, with model swapping, FaaSwap can concurrently serve hundreds of functions on a single worker node with 4 V100 GPUs, while achieving inference performance comparable to native execution (where each function runs on a dedicated GPU). When deployed on a 6-node production testbed, FaaSwap meets the latency SLOs for over 1k functions, the maximum that the testbed can handle concurrently.

Subjects:	Distributed, Parallel, and Cluster Computing (cs.DC)
Cite as:	arXiv:2306.03622 [cs.DC]
	(or arXiv:2306.03622v2 [cs.DC] for this version)
	https://doi.org/10.48550/arXiv.2306.03622

Submission history

From: Minchen Yu [view email]
[v1] Tue, 6 Jun 2023 12:19:05 UTC (500 KB)
[v2] Thu, 8 Feb 2024 12:34:16 UTC (357 KB)

Computer Science > Distributed, Parallel, and Cluster Computing

Title:FaaSwap: SLO-Aware, GPU-Efficient Serverless Inference via Model Swapping

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Distributed, Parallel, and Cluster Computing

Title:FaaSwap: SLO-Aware, GPU-Efficient Serverless Inference via Model Swapping

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators