Adaptive heterogeneous scheduling for integrated GPUs

R Kaleem, R Barik, T Shpeisman, BT Lewis… - Proceedings of the 23rd …, 2014 - dl.acm.org
Proceedings of the 23rd international conference on Parallel architectures …, 2014dl.acm.org
Many processors today integrate a CPU and GPU on the same die, which allows them to
share resources like physical memory and lowers the cost of CPU-GPU communication. As
a consequence, programmers can effectively utilize both the CPU and GPU to execute a
single application. This paper presents novel adaptive scheduling techniques for integrated
CPU-GPU processors. We present two online profiling-based scheduling algorithms: naïve
and asymmetric. Our asymmetric scheduling algorithm uses low-overhead online profiling to …
Many processors today integrate a CPU and GPU on the same die, which allows them to share resources like physical memory and lowers the cost of CPU-GPU communication. As a consequence, programmers can effectively utilize both the CPU and GPU to execute a single application. This paper presents novel adaptive scheduling techniques for integrated CPU-GPU processors. We present two online profiling-based scheduling algorithms: naïve and asymmetric. Our asymmetric scheduling algorithm uses low-overhead online profiling to automatically partition the work of data-parallel kernels between the CPU and GPU without input from application developers. It does profiling on the CPU and GPU in a way that it doesn't penalize GPU-centric workloads that run significantly faster on the GPU. It adapts to application characteristics by addressing: 1) load imbalance via irregularity caused by, e.g., data-dependent control flow, 2) different amounts of work on each kernel call, and 3) multiple kernels with different characteristics. Unlike many existing approaches primarily targeting NVIDIA discrete GPUs, our scheduling algorithm does not require offline processing.
We evaluate our asymmetric scheduling algorithm on a desktop system with an Intel 4th generation Core processor using a set of sixteen regular and irregular workloads from diverse application areas. On average, our asymmetric scheduling algorithm performs within 3.2% of the maximum throughput with a perfect CPU-and-GPU oracle that always chooses the ideal work partitioning between the CPU and GPU. These results underscore the feasibility of online profile-based heterogeneous scheduling on integrated CPU-GPU processors.
ACM Digital Library
Showing the best result for this search. See all results