Skip to main content

Accelerating Data Analytics Kernels with Heterogeneous Computing

  • Chapter
  • First Online:
Emerging Technology and Architecture for Big-data Analytics

Abstract

Heterogeneous computing platforms combining general-purpose processing elements with different accelerators (such as GPU or FPGAs) are ideally suited for efficient processing of compute-intensive data analytics kernels. In this chapter, we focus on the acceleration of data analytics kernels on heterogenous computing systems with FPGAs. The introduction of FPGAs in the context of data analytics is negatively impacted by the difficulty in programming such systems given the increasing complexity of FPGA-based accelerators. This makes high-level synthesis (HLS) an attractive solution to improve designer productivity by abstracting the programming effort above register-transfer level (RTL). HLS offers various architectural design options with different trade-offs via pragmas (loop unrolling, loop pipelining, array partitioning). However, non-negligible HLS runtime renders manual or automated HLS-based exhaustive architectural exploration for implementation of the kernels practically infeasible. To address this challenge, we have developed Lin-Analyzer, a high-level accurate performance analysis tool that enables rapid design space exploration with various pragmas for FPGA-based accelerators without requiring RTL implementations. We show how Lin-Analyzer can enable easy but performance efficient implementation of computational kernels from a variety of data analytics applications onto FPGA-based heterogeneous systems.

Alok completed this project while working at SoC, NUS

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. S. Bilavarn, G. Gogniat, J.L. Philippe, L. Bossuet, Design space pruning through early estimations of area/delay tradeoffs for FPGA implementations. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 25. Doi:10.1109/TCAD.2005.862742

    Google Scholar 

  2. Cadence Inc. C-to-Silicon Compiler (2015)

    Google Scholar 

  3. A. Canis, J. Choi, M. Aldham et al., LegUp: high-level synthesis for FPGA-based processor/accelerator systems, in Proceedings of the 19th ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA’2011), Monterey (2011)

    Book  Google Scholar 

  4. A. Canis, D. Brown, J.H., Anderson, Modulo SDC scheduling with recurrence minimization in high-level synthesis, in The 24th International Conference on Field Programmable Logic and Applications (FPL), Munich (2014)

    Google Scholar 

  5. S. Che, M. Boyer, J. Meng, D. Tarjan, J.W. Sheaffer, K. Skadron, A performance study of general-purpose applications on graphics processors using CUDA. J. Parallel Distrib. Comput. 68 (10), 1370–1380 (2008)

    Article  Google Scholar 

  6. S. Che, J.W. Sheaffer, M. Boyer, L.G. Szafaryn, L. Wang, K. Skadron, A characterization of the Rodinia benchmark suite with comparison to contemporary CMP workloads, in 2010 IEEE International Symposium on in Workload Characterization (IISWC) (2010), pp. 1–11

    Google Scholar 

  7. J. Cong, Z. Zhang, An efficient and versatile scheduling algorithm based on SDC formulation, in The 43rd ACM/IEEE Design Automation Conference (DAC’2006), San Francisco (2006)

    Google Scholar 

  8. J. Cong, W. Jiang, B. Liu, Y. Zou, Automatic memory partitioning and scheduling for throughput and power optimization, in IEEE/ACM International Conference on Computer-Aided Design - Digest of Technical Papers, San Jose, CA (2009)

    Google Scholar 

  9. J. Cong, M. Huang, P. Pan, Y. Wang, P. Zhang, Source-to-Source Optimization for HLS, FPGAs for Software Programmers, chap. 8 (Springer International Publishing, Cham, 2016), pp. 137–163. Doi:http://dx.doi.org/10.1145/2209291.2209302. ISBN 978-3-319-26408-0

  10. W.J. Dally, J.D. Balfour, D. Black-Schaffer, J. Chen, R.C. Harting, V. Parikh, J. Park, D. Sheffield, Efficient embedded computing. IEEE Comput. 41 (7), 27–32 (2008)

    Article  Google Scholar 

  11. R.H. Dennard, F.H. Gaensslen, V.L. Rideout, E. Bassous, A.R. LeBlanc, Design of ion-implanted MOSFET’s with very small physical dimensions. IEEE J. Solid State Circuits 9 (5), 256–268 (1974)

    Article  Google Scholar 

  12. H. Esmaeilzadeh, E. Blem, R. St Amant, K. Sankaralingam, D. Burger, Dark silicon and the end of multicore scaling, in 2011 38th Annual International Symposium on Computer Architecture (ISCA) (IEEE, New York, 2011), pp. 365–376

    Google Scholar 

  13. A.P. Greenhalgh, Big.LITTLE processing with ARM Cortex-A15 & Cortex-A7 (2011)

    Google Scholar 

  14. M. Guevara, B. Lubin, B.C. Lee, Navigating heterogeneous processors with market mechanisms, in 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA2013) (IEEE, New York, 2013), pp. 95–106

    Book  Google Scholar 

  15. J. Holewinski, R. Ramamurthi, M. Ravishankar, N. Fauzia, L.N. Pouchet, A. Rountev, P. Sadayappan, Dynamic trace-based analysis of vectorization potential of applications, in The 33rd ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), Beijing (2012)

    Google Scholar 

  16. Ineda Systems, Hierarchical computing (2014). [Online]

    Google Scholar 

  17. Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, T. Darrell, Caffe: convolutional architecture for fast feature embedding. Preprint (2014). arXiv:1408.5093

    Google Scholar 

  18. R. Kumar, K.I. Farkas, N.P. Jouppi, P. Ranganathan, D.M. Tullsen, Single-ISA heterogeneous multi-core architectures: the potential for processor power reduction, in MICRO (2003), pp. 81–92

    Google Scholar 

  19. C. Lattner, V. Adve, LLVM: a compilation framework for lifelong program analysis & transformation, in Proceedings of the International Symposium on Code Generation and Optimization: Feedback-directed and Runtime Optimization (CGO), Palo Alto, CA (2004)

    Google Scholar 

  20. P. Li, P. Zhang, L.N. Pouchet, J. Cong, Resource-Aware Throughput Optimization for High-Level Synthesis, in The 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA), Monterey, CA (2015)

    Google Scholar 

  21. Y. Liang, K. Rupnow, Y. Li, D. Min, M.N. Do, D. Chen, High-level synthesis: productivity, performance, and software constraints. J. Electr. Comput. Eng. 2012 (2012). Doi:10.1155/2012/649057

    Google Scholar 

  22. H. Liu, L.P. Carloni, On learning-based methods for design-space exploration with high-level synthesis, in The 50th Annual Design Automation Conference (DAC), Austin (2013)

    Google Scholar 

  23. G.E. Moore, Cramming more components onto integrated circuits. Proc. IEEE 86 (1), 82–85 (1998)

    Article  Google Scholar 

  24. T.S. Muthukaruppan, M. Pricopi, V. Venkataramani, T. Mitra, S. Vishin, Hierarchical power management for asymmetric multi-core in dark silicon era, in Proceedings of the 50th Annual Design Automation Conference (ACM, New York, 2013), p. 174

    Google Scholar 

  25. T.S. Muthukaruppan, A. Pathania, T. Mitra, Price theory based power management for heterogeneous multi-cores, in Proceedings of the 19th International Conference on Architectural Support for Programming Languages and operating systems (ACM, New York, 2014), pp. 161–176

    Google Scholar 

  26. nVidia, Variable SMP—a multi-core CPU architecture for low power and high performance (2011)

    Google Scholar 

  27. Odroid-XU3. http://goo.gl/Nn6z3O

  28. A. Pathania, Q. Jiao, A. Prakash, T. Mitra, Integrated CPU-GPU power management for 3D mobile games,” in Proceedings of the the 51st Annual Design Automation Conference on Design Automation Conference (ACM, New York, 2014), pp. 1–6

    Google Scholar 

  29. N. Pham, A.K. Singh, A. Kumar, M.M.A. Khin, Exploiting loop-array dependencies to accelerate the design space exploration with high level synthesis, in Proceedings of the 2015 Design, Automation & Test in Europe Conference & Exhibition, San Jose, CA (2015)

    Google Scholar 

  30. L. Pouchet, PolyBench/C3.2 (2012)

    Google Scholar 

  31. M. Pricopi, T. Mitra, Bahurupi: a polymorphic heterogeneous multi-core architecture. ACM Trans. Archit. Code Optim. 8 (4), 22 (2012)

    Google Scholar 

  32. M. Pricopi, T. Mitra, Task scheduling on adaptive multi-core. IEEE Trans. Comput. 63 (10), 2590–2603 (2014)

    Article  MathSciNet  Google Scholar 

  33. M. Pricopi, T.S. Muthukaruppan, V. Venkataramani, T. Mitra, S. Vishin, Power-performance modeling on asymmetric multi-cores, in 2013 International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES) (2013), pp. 1–10

    Google Scholar 

  34. A. Prost-Boucle, O. Muller, F. Rousseau, A fast and autonomous HLS methodology for hardware accelerator generation under resource constraints, in Euromicro Conference on Digital System Design (DSD), Los Alamitos, CA (2013)

    Google Scholar 

  35. A. Putnam, A.M. Caulfield, E.S. Chung, D. Chiou, K. Constantinides et al., A reconfigurable fabric for accelerating large-scale datacenter services, in Proceeding of the 41st Annual International Symposium on Computer Architecuture (IEEE, New York, 2014), pp. 13–24

    Google Scholar 

  36. J. Redmon, S. Divvala, R. Girshick, A. Farhadi, You only look once: unified, real-time object detection. Preprint (2015). arXiv:1506.02640

    Google Scholar 

  37. B.C. Schafer, K. Wakabayashi, Divide and conquer high-level synthesis design space exploration. ACM Trans. Des. Autom. Electron. Syst. 17 (3), Article 29 (2012), 19pp. Doi:http://dx.doi.org/10.1145/2209291.2209302

  38. Y. Shao, B. Reagen, G.Y. Wei, D. Brooks, Aladdin: a pre-RTL, power-performance accelerator simulator enabling large design space exploration of customized architectures, in The 41st Annual International Symposium on Computer Architecture (ISCA), Minneapolis (2014)

    Google Scholar 

  39. B. So, M.W. Hall, P.C. Diniz, A compiler approach to fast hardware design space exploration in FPGA-based systems, in Proceedings of the ACM SIGPLAN 2002 Conference on Programming Language Design and Implementation, Berlin (2002)

    Google Scholar 

  40. Synopsys Inc. (2015)

    Google Scholar 

  41. M.A. Todd, S.S. Gurindar, Dynamic dependency analysis of ordinary programs, in The 19th Annual International Symposium on Computer Architecture, New York (1992)

    Google Scholar 

  42. F.M. Vallina, C. Kohn, P. Joshi, Zynq all programmable SoC Sobel filter implementation using the Vivado HLS tool. Application Note XAPP890, Xilinx (2012)

    Google Scholar 

  43. Xilinx Inc. (2015)

    Google Scholar 

  44. Z. Zhang, B. Liu, SDC-based modulo scheduling for pipeline synthesis, in IEEE/ACM International Conference on Computer-Aided Design (ICCAD), San Jose, CA (2013)

    Google Scholar 

  45. G. Zhong, V. Venkataramani, Y. Liang, T. Mitra, S. Niar, Design space exploration of multiple loops on FPGAs using high level synthesis, in 2014 IEEE 32nd International Conference on Computer Design (ICCD), Seoul (2014)

    Google Scholar 

  46. G. Zhong, A. Prakash, Y. Liang, T. Mitra, S. Niar, Lin-analyzer: a high-level performance analysis tool for FPGA-based accelerators, in The 53rd Annual Design Automation Conference (DAC), Austin (2016)

    Book  Google Scholar 

Download references

Acknowledgements

This work was partially supported by the Singapore Ministry of Education Academic Research Fund Tier 2 MOE2015-T2-2-088.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tulika Mitra .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this chapter

Cite this chapter

Zhong, G., Prakash, A., Mitra, T. (2017). Accelerating Data Analytics Kernels with Heterogeneous Computing. In: Chattopadhyay, A., Chang, C., Yu, H. (eds) Emerging Technology and Architecture for Big-data Analytics. Springer, Cham. https://doi.org/10.1007/978-3-319-54840-1_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-54840-1_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-54839-5

  • Online ISBN: 978-3-319-54840-1

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics