Proliferation of high resolution cameras on embedded devices along with the growing maturity of deep neural networks (DNNs) have enabled powerful mobile vision applications. To support these applications, resource-constrained mobile devices can be extended with server-class processors and GPU accelerators via mobile offloading techniques. However, this is challenging in latency-constrained applications because of large and unpredictable round-trip delays from mobile devices to the cloud computing resources. As a consequence, system designers routinely look for ways to offload to local servers at the network edge, known as the cloudlet.
This dissertation explores the potential of building emerging continuous mobile vision applications using the cloudlet. We identify two challenges in implementing a multi-tenant and real-time mobile offloading framework. First, an application may exploit DNNs for diverse tasks, e.g., object classification, detection, and tracking. We need methods to reduce delay and to improve throughput in DNN inference. Second, when the system hosts multiple clients and various applications concurrently, the contention on computing resources, e.g., CPUs and GPUs, leads to longer delays. The scheduling techniques that mitigate resource contention are thus essential for a low-latency serving system.
To address these challenges, we design new APIs, systems, and schedulers that enable a high-performance mobile offloading framework. The framework manages data as in-memory key-value pairs that can be cached on servers to avoid redundant data transfers. An application is programmed as a Directed Acyclic Graph (DAG) of queries that process mobile data. We investigate the infrastructure optimizations, such as batching and parallelization of DNN inference, for a variety of vision tasks, e.g., object detection, tracking, scene graph detection, and video description. To improve the level of resource utilization, our system co-locates delay-critical and delay-tolerant workloads on shared GPUs, using a new predictive approach. Adaptive batching algorithms that consider both accuracy and delay of DNN inference in a heterogeneous cluster of GPUs are studied as well. For CPU workloads, we present methods to mitigate resource contention by predicting future workloads, estimating contention, and adjusting task start times to remove the contention. We demonstrate the effectiveness of the framework through several evaluations with real world applications.