Accelerating Data Analytics Kernels with Heterogeneous Computing

Zhong, Guanwen; Prakash, Alok; Mitra, Tulika

doi:10.1007/978-3-319-54840-1_2

Guanwen Zhong⁴,
Alok Prakash⁵ &
Tulika Mitra⁴

1644 Accesses

Abstract

Heterogeneous computing platforms combining general-purpose processing elements with different accelerators (such as GPU or FPGAs) are ideally suited for efficient processing of compute-intensive data analytics kernels. In this chapter, we focus on the acceleration of data analytics kernels on heterogenous computing systems with FPGAs. The introduction of FPGAs in the context of data analytics is negatively impacted by the difficulty in programming such systems given the increasing complexity of FPGA-based accelerators. This makes high-level synthesis (HLS) an attractive solution to improve designer productivity by abstracting the programming effort above register-transfer level (RTL). HLS offers various architectural design options with different trade-offs via pragmas (loop unrolling, loop pipelining, array partitioning). However, non-negligible HLS runtime renders manual or automated HLS-based exhaustive architectural exploration for implementation of the kernels practically infeasible. To address this challenge, we have developed Lin-Analyzer, a high-level accurate performance analysis tool that enables rapid design space exploration with various pragmas for FPGA-based accelerators without requiring RTL implementations. We show how Lin-Analyzer can enable easy but performance efficient implementation of computational kernels from a variety of data analytics applications onto FPGA-based heterogeneous systems.

Alok completed this project while working at SoC, NUS

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Assessing the Performance and Suitability of FPGAs as Hardware Accelerator for Software Programmers

Generating High-Performance FPGA Accelerator Designs for Big Data Analytics with Fletcher and Apache Arrow

Article Open access 01 March 2021

A Methodology for Efficient Use of OpenCL, ESL and FPGAs in Multi-core Architectures

References

S. Bilavarn, G. Gogniat, J.L. Philippe, L. Bossuet, Design space pruning through early estimations of area/delay tradeoffs for FPGA implementations. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 25. Doi:10.1109/TCAD.2005.862742
Google Scholar
Cadence Inc. C-to-Silicon Compiler (2015)
Google Scholar
A. Canis, J. Choi, M. Aldham et al., LegUp: high-level synthesis for FPGA-based processor/accelerator systems, in Proceedings of the 19th ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA’2011), Monterey (2011)
Book Google Scholar
A. Canis, D. Brown, J.H., Anderson, Modulo SDC scheduling with recurrence minimization in high-level synthesis, in The 24th International Conference on Field Programmable Logic and Applications (FPL), Munich (2014)
Google Scholar
S. Che, M. Boyer, J. Meng, D. Tarjan, J.W. Sheaffer, K. Skadron, A performance study of general-purpose applications on graphics processors using CUDA. J. Parallel Distrib. Comput. 68 (10), 1370–1380 (2008)
Article Google Scholar
S. Che, J.W. Sheaffer, M. Boyer, L.G. Szafaryn, L. Wang, K. Skadron, A characterization of the Rodinia benchmark suite with comparison to contemporary CMP workloads, in 2010 IEEE International Symposium on in Workload Characterization (IISWC) (2010), pp. 1–11
Google Scholar
J. Cong, Z. Zhang, An efficient and versatile scheduling algorithm based on SDC formulation, in The 43rd ACM/IEEE Design Automation Conference (DAC’2006), San Francisco (2006)
Google Scholar
J. Cong, W. Jiang, B. Liu, Y. Zou, Automatic memory partitioning and scheduling for throughput and power optimization, in IEEE/ACM International Conference on Computer-Aided Design - Digest of Technical Papers, San Jose, CA (2009)
Google Scholar
J. Cong, M. Huang, P. Pan, Y. Wang, P. Zhang, Source-to-Source Optimization for HLS, FPGAs for Software Programmers, chap. 8 (Springer International Publishing, Cham, 2016), pp. 137–163. Doi:http://dx.doi.org/10.1145/2209291.2209302. ISBN 978-3-319-26408-0
W.J. Dally, J.D. Balfour, D. Black-Schaffer, J. Chen, R.C. Harting, V. Parikh, J. Park, D. Sheffield, Efficient embedded computing. IEEE Comput. 41 (7), 27–32 (2008)
Article Google Scholar
R.H. Dennard, F.H. Gaensslen, V.L. Rideout, E. Bassous, A.R. LeBlanc, Design of ion-implanted MOSFET’s with very small physical dimensions. IEEE J. Solid State Circuits 9 (5), 256–268 (1974)
Article Google Scholar
H. Esmaeilzadeh, E. Blem, R. St Amant, K. Sankaralingam, D. Burger, Dark silicon and the end of multicore scaling, in 2011 38th Annual International Symposium on Computer Architecture (ISCA) (IEEE, New York, 2011), pp. 365–376
Google Scholar
A.P. Greenhalgh, Big.LITTLE processing with ARM Cortex-A15 & Cortex-A7 (2011)
Google Scholar
M. Guevara, B. Lubin, B.C. Lee, Navigating heterogeneous processors with market mechanisms, in 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA2013) (IEEE, New York, 2013), pp. 95–106
Book Google Scholar
J. Holewinski, R. Ramamurthi, M. Ravishankar, N. Fauzia, L.N. Pouchet, A. Rountev, P. Sadayappan, Dynamic trace-based analysis of vectorization potential of applications, in The 33rd ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), Beijing (2012)
Google Scholar
Ineda Systems, Hierarchical computing (2014). [Online]
Google Scholar
Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, T. Darrell, Caffe: convolutional architecture for fast feature embedding. Preprint (2014). arXiv:1408.5093
Google Scholar
R. Kumar, K.I. Farkas, N.P. Jouppi, P. Ranganathan, D.M. Tullsen, Single-ISA heterogeneous multi-core architectures: the potential for processor power reduction, in MICRO (2003), pp. 81–92
Google Scholar
C. Lattner, V. Adve, LLVM: a compilation framework for lifelong program analysis & transformation, in Proceedings of the International Symposium on Code Generation and Optimization: Feedback-directed and Runtime Optimization (CGO), Palo Alto, CA (2004)
Google Scholar
P. Li, P. Zhang, L.N. Pouchet, J. Cong, Resource-Aware Throughput Optimization for High-Level Synthesis, in The 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA), Monterey, CA (2015)
Google Scholar
Y. Liang, K. Rupnow, Y. Li, D. Min, M.N. Do, D. Chen, High-level synthesis: productivity, performance, and software constraints. J. Electr. Comput. Eng. 2012 (2012). Doi:10.1155/2012/649057
Google Scholar
H. Liu, L.P. Carloni, On learning-based methods for design-space exploration with high-level synthesis, in The 50th Annual Design Automation Conference (DAC), Austin (2013)
Google Scholar
G.E. Moore, Cramming more components onto integrated circuits. Proc. IEEE 86 (1), 82–85 (1998)
Article Google Scholar
T.S. Muthukaruppan, M. Pricopi, V. Venkataramani, T. Mitra, S. Vishin, Hierarchical power management for asymmetric multi-core in dark silicon era, in Proceedings of the 50th Annual Design Automation Conference (ACM, New York, 2013), p. 174
Google Scholar
T.S. Muthukaruppan, A. Pathania, T. Mitra, Price theory based power management for heterogeneous multi-cores, in Proceedings of the 19th International Conference on Architectural Support for Programming Languages and operating systems (ACM, New York, 2014), pp. 161–176
Google Scholar
nVidia, Variable SMP—a multi-core CPU architecture for low power and high performance (2011)
Google Scholar
Odroid-XU3. http://goo.gl/Nn6z3O
A. Pathania, Q. Jiao, A. Prakash, T. Mitra, Integrated CPU-GPU power management for 3D mobile games,” in Proceedings of the the 51st Annual Design Automation Conference on Design Automation Conference (ACM, New York, 2014), pp. 1–6
Google Scholar
N. Pham, A.K. Singh, A. Kumar, M.M.A. Khin, Exploiting loop-array dependencies to accelerate the design space exploration with high level synthesis, in Proceedings of the 2015 Design, Automation & Test in Europe Conference & Exhibition, San Jose, CA (2015)
Google Scholar
L. Pouchet, PolyBench/C3.2 (2012)
Google Scholar
M. Pricopi, T. Mitra, Bahurupi: a polymorphic heterogeneous multi-core architecture. ACM Trans. Archit. Code Optim. 8 (4), 22 (2012)
Google Scholar
M. Pricopi, T. Mitra, Task scheduling on adaptive multi-core. IEEE Trans. Comput. 63 (10), 2590–2603 (2014)
Article MathSciNet Google Scholar
M. Pricopi, T.S. Muthukaruppan, V. Venkataramani, T. Mitra, S. Vishin, Power-performance modeling on asymmetric multi-cores, in 2013 International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES) (2013), pp. 1–10
Google Scholar
A. Prost-Boucle, O. Muller, F. Rousseau, A fast and autonomous HLS methodology for hardware accelerator generation under resource constraints, in Euromicro Conference on Digital System Design (DSD), Los Alamitos, CA (2013)
Google Scholar
A. Putnam, A.M. Caulfield, E.S. Chung, D. Chiou, K. Constantinides et al., A reconfigurable fabric for accelerating large-scale datacenter services, in Proceeding of the 41st Annual International Symposium on Computer Architecuture (IEEE, New York, 2014), pp. 13–24
Google Scholar
J. Redmon, S. Divvala, R. Girshick, A. Farhadi, You only look once: unified, real-time object detection. Preprint (2015). arXiv:1506.02640
Google Scholar
B.C. Schafer, K. Wakabayashi, Divide and conquer high-level synthesis design space exploration. ACM Trans. Des. Autom. Electron. Syst. 17 (3), Article 29 (2012), 19pp. Doi:http://dx.doi.org/10.1145/2209291.2209302
Y. Shao, B. Reagen, G.Y. Wei, D. Brooks, Aladdin: a pre-RTL, power-performance accelerator simulator enabling large design space exploration of customized architectures, in The 41st Annual International Symposium on Computer Architecture (ISCA), Minneapolis (2014)
Google Scholar
B. So, M.W. Hall, P.C. Diniz, A compiler approach to fast hardware design space exploration in FPGA-based systems, in Proceedings of the ACM SIGPLAN 2002 Conference on Programming Language Design and Implementation, Berlin (2002)
Google Scholar
Synopsys Inc. (2015)
Google Scholar
M.A. Todd, S.S. Gurindar, Dynamic dependency analysis of ordinary programs, in The 19th Annual International Symposium on Computer Architecture, New York (1992)
Google Scholar
F.M. Vallina, C. Kohn, P. Joshi, Zynq all programmable SoC Sobel filter implementation using the Vivado HLS tool. Application Note XAPP890, Xilinx (2012)
Google Scholar
Xilinx Inc. (2015)
Google Scholar
Z. Zhang, B. Liu, SDC-based modulo scheduling for pipeline synthesis, in IEEE/ACM International Conference on Computer-Aided Design (ICCAD), San Jose, CA (2013)
Google Scholar
G. Zhong, V. Venkataramani, Y. Liang, T. Mitra, S. Niar, Design space exploration of multiple loops on FPGAs using high level synthesis, in 2014 IEEE 32nd International Conference on Computer Design (ICCD), Seoul (2014)
Google Scholar
G. Zhong, A. Prakash, Y. Liang, T. Mitra, S. Niar, Lin-analyzer: a high-level performance analysis tool for FPGA-based accelerators, in The 53rd Annual Design Automation Conference (DAC), Austin (2016)
Book Google Scholar

Download references

Acknowledgements

This work was partially supported by the Singapore Ministry of Education Academic Research Fund Tier 2 MOE2015-T2-2-088.

Author information

Authors and Affiliations

School of Computing, National University of Singapore, Singapore, Singapore
Guanwen Zhong & Tulika Mitra
School of Computer Science and Engineering, Nanyang Technological University, Singapore, Singapore
Alok Prakash

Authors

Guanwen Zhong
View author publications
You can also search for this author in PubMed Google Scholar
Alok Prakash
View author publications
You can also search for this author in PubMed Google Scholar
Tulika Mitra
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tulika Mitra .

Editor information

Editors and Affiliations

School of Computer Science and Engineering, School of Physical and Mathematical Sciences, Nanyang Technological University, Singapore, Singapore
Anupam Chattopadhyay
School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore, Singapore
Chip Hong Chang
School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore, Singapore
Hao Yu

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Zhong, G., Prakash, A., Mitra, T. (2017). Accelerating Data Analytics Kernels with Heterogeneous Computing. In: Chattopadhyay, A., Chang, C., Yu, H. (eds) Emerging Technology and Architecture for Big-data Analytics. Springer, Cham. https://doi.org/10.1007/978-3-319-54840-1_2

Download citation

DOI: https://doi.org/10.1007/978-3-319-54840-1_2
Published: 21 April 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-54839-5
Online ISBN: 978-3-319-54840-1
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics

Accelerating Data Analytics Kernels with Heterogeneous Computing

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Assessing the Performance and Suitability of FPGAs as Hardware Accelerator for Software Programmers

Generating High-Performance FPGA Accelerator Designs for Big Data Analytics with Fletcher and Apache Arrow

A Methodology for Efficient Use of OpenCL, ESL and FPGAs in Multi-core Architectures

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Accelerating Data Analytics Kernels with Heterogeneous Computing

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Assessing the Performance and Suitability of FPGAs as Hardware Accelerator for Software Programmers

Generating High-Performance FPGA Accelerator Designs for Big Data Analytics with Fletcher and Apache Arrow

A Methodology for Efficient Use of OpenCL, ESL and FPGAs in Multi-core Architectures

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation