Multi-model Machine Learning Inference Serving with GPU Spatial Partitioning

Choi, Seungbeom; Lee, Sunho; Kim, Yeonjae; Park, Jongse; Kwon, Youngjin; Huh, Jaehyuk

Abstract:As machine learning techniques are applied to a widening range of applications, high throughput machine learning (ML) inference servers have become critical for online service applications. Such ML inference servers pose two challenges: first, they must provide a bounded latency for each request to support consistent service-level objective (SLO), and second, they can serve multiple heterogeneous ML models in a system as certain tasks involve invocation of multiple models and consolidating multiple models can improve system utilization. To address the two requirements of ML inference servers, this paper proposes a new ML inference scheduling framework for multi-model ML inference servers. The paper first shows that with SLO constraints, current GPUs are not fully utilized for ML inference tasks. To maximize the resource efficiency of inference servers, a key mechanism proposed in this paper is to exploit hardware support for spatial partitioning of GPU resources. With the partitioning mechanism, a new abstraction layer of GPU resources is created with configurable GPU resources. The scheduler assigns requests to virtual GPUs, called gpu-lets, with the most effective amount of resources. The paper also investigates a remedy for potential interference effects when two ML tasks are running concurrently in a GPU. Our prototype implementation proves that spatial partitioning enhances throughput by 102.6% on average while satisfying SLOs.

Subjects:	Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2109.01611 [cs.DC]
	(or arXiv:2109.01611v1 [cs.DC] for this version)
	https://doi.org/10.48550/arXiv.2109.01611

Computer Science > Distributed, Parallel, and Cluster Computing

Title:Multi-model Machine Learning Inference Serving with GPU Spatial Partitioning

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators