It is known that with the support of domain–specific customizable heterogeneous architecture, energy efficiency can be significantly improved by adapting architectures to match the requirements of a given application or application domain. One of the main challenges in this emerging trend is how to efficiently take the advantage of the heterogeneity and customization features in those architectures. This research investigates developing efficient compiler support to automate the platform mapping and code transformation process.
First, considering customizable computing engines, we have investigated both tightly–coupled and loosely–coupled computing elements. In terms of tightly–coupled computing engine customization, customizable vector ISA supports are explored to better exploit data–level parallelism in the high performance applications. We identify the needs and opportunities to explore customized vector instructions and quantify their benefits. We build an automatic compilation flow in LLVM–2.7 compiler infrastructure to efficiently identify customized vector instructions from a given set of applications. The memory alignment overhead, which is known to be critical for vector processing efficiency, has been optimized in our customized vector ISA identification flow. To support efficient vector ISA customization, we design a composable vector unit (CVU), which can be used both separately and in a chained mode, to support a large number of virtualized custom vector instructions with minimal area overhead. The results show that our approach achieves an average 27% speedup over the state–of–art vector ISA.
Second, in terms of loosely–coupled computing elements, it is known that on–chip accelerators are combined with general–purpose cores in an effort to amortize the cost of the design across many application domains. In recent days programmable accelerators (PA) are widely investigated in the design of domain–specific architectures to improve the system performance and power. Micro–architectures with a series of PAs have been explored to provide more general supports for customization. One important feature in the PA–rich systems is that the target computational kernels are compiled with a set of pre–defined PA templates and dynamically mapped to real PAs at runtime. This imposes a demanding challenge on the compiler side regarding how to generate high–quality PA mapping code. We present an efficient PA compilation flow, which is fairly scalable in mapping large computation kernels into PA–rich architectures and provides support for full pipelined execution to achieve the highest energy efficiency. A concept called maximal PA candidate is proposed to drastically reduce the number of input PA candidates in the mapping phase without influencing the overall mapping optimality. Efficient pre–selection and pruning techniques are employed to further speedup the maximal PA mapping process. Our experimental results show that for 12 computation–intensive standard benchmarks, the proposed approach achieves a significant improvement on the compilation time comparing to the state–of–art PA compilation approaches. The average mapping quality is improved by 23.8% and 32.5% for connected PA candidates and disjoint ones, respectively.
Third, in domain–specific computing multi–level software–controlled memories have been commonly used to better utilize domain–specific knowledge of particular applications and achieve high performance/energy efficiency. At the level of L1 memory, while conventional cache works well for general workloads, some recent works explore the idea of using a hybrid cache, which can be flexibly partitioned into a traditional cache and an SCM. In the hybrid cache architecture, first–level SCM has been utilized as prefetch buffer to hide memory access latency. We quantify the impact of data reuse on SCM prefetching efficiency and propose a reuse–aware SCM prefetching (RASP) scheme, which shows 31.2% performance gain over previous work. On the other hand, SCM has also been widely used in last level on–board memory to reduce the data movements between computing cores (i.e. host processor and accelerator cores), which is usually transferred through low–bandwidth bus and known to be one of the major performance bottlenecks in modern heterogeneous systems. To efficiently manage LL–SCM, we propose a task–level–reuse–graph (TLRM) based LL–SCM data movement scheme to minimize the amount of data transfers between heterogeneous computing cores through the slow PCIe bus. With the introduction of TLRM, the data movement optimization between host and accelerator cores can be approximated using a linear programming based solution, and an average 25% reduction of host–accelerator data transfers is observed from previous work.