Mohamed KHALIL-HANI, Sayed Omid AYAT, Ab Al-Hadi AB RAHMAN

Optimizing FPGA-based CNN accelerator for energy efficiency with an extended Roofline model

In recent years, the convolutional neural network (CNN) has found wide acceptance in solving practicalcomputer vision and image recognition problems. Also recently, due to its flexibility, faster development time, andenergy efficiency, the field-programmable gate array (FPGA) has become an attractive solution to exploit the inherentparallelism in the feedforward process of the CNN. However, to meet the demands for high accuracy of today’s practicalrecognition applications that typically have massive datasets, the sizes of CNNs have to be larger and deeper. Enlargementof the CNN aggravates the problem of off-chip memory bottleneck in the FPGA platform since there is not enough spaceto save large datasets on-chip. In this work, we propose a memory system architecture that best matches the off-chipmemory traffic with the optimum throughput of the computation engine, while it operates at the maximum allowablefrequency. With the help of an extended version of the Roofline model proposed in this work, we can estimate memorybandwidth utilization of the system at different operating frequencies since the proposed model considers operatingfrequency in addition to bandwidth utilization and throughput. In order to find the optimal solution that has the bestenergy efficiency, we make a trade-off between energy efficiency and computational throughput. This solution saves18% of energy utilization with the trade-off having less than 2% reduction in throughput performance. We also proposeto use a race-to-halt strategy to further improve the energy efficiency of the designed CNN accelerator. Experimentalresults show that our CNN accelerator can achieve a peak performance of 52.11 GFLOPS and energy efficiency of 10.02GFLOPS/W on a ZYNQ ZC706 FPGA board running at 250 MHz, which outperforms most previous approaches.

PDF

___

[1] Parisa Beham M, Mohamed S, Roomi M. A review of face recognition methods. Int J Pattern Recogn 2013; 27: 1-35.
[2] Farabet C, Poulet C, Han JY, LeCun Y. CNP: An FPGA-based processor for convolutional networks. In: International Conference on Field Programmable Logic and Applications; 31 August 2009; Prague, Czech Republic. New York, NY, USA: IEEE. pp. 32-37.
[3] Sankaradas M, Jakkula V, Cadambi S, Chakradhar S, Durdanovic I, Cosatto E, Graf HP. A massively parallel coprocessor for convolutional neural networks. In: IEEE International Conference on Application-specific Systems, Architectures and Processors; 7 July 2009; Boston, MA, USA. New York, NY, USA: IEEE. pp. 53-60.
[4] Hadsell R, Sermanet P, Ben J, Erkan A, Scoffier M, Kavukcuoglu K, Muller U, LeCun Y. Learning long-range vision for autonomous off-road driving. J Field Robot 2009; 26: 120-144.
[5] Maria J, Amaro J, Falcao G, Alexandre LA. Stacked autoencoders using low-power accelerated architectures for object recognition in autonomous systems. Neural Process Lett 2016; 43: 445-458.
[6] Krizhevsky A, Sutskever I, Hinton GE. ImageNet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems; 3–8 December 2012; Nevada, USA. pp. 1097-1105.
[7] Han S, Liu X, Mao H, Pu J, Pedram A, Horowitz MA, Dally WJ. EIE: Efficient inference engine on compressed deep neural network. In: Proceedings of the 2016 International Symposium on Computer Architecture; 18–22 June 2016; Seoul, Republic of Korea. New York, NY, USA: IEEE. pp. 243-254.
[8] Buyukkurt B, Guo Z, Najjar W. Impact of loop unrolling on area, throughput and clock frequency in ROCCC: C to VHDL compiler for FPGAs. Lect Notes Comp Sci 2006; 3985: 401-412.
[9] Kim NS, Austin T, Baauw D, Mudge T, Flautner K, Hu JS, Irwin MJ, Kandemir M, Narayanan V. Leakage current: Moore’s law meets static power. Computer 2003; 36: 68-75.
[10] Awan MA, Petters SM. Race-to-halt energy saving strategies. J Syst Architect 2014; 60: 796-815.
[11] Jamieson P, Luk W, Wilton SJ, Constantinides GA. An energy and power consumption analysis of FPGA routing architectures. In: 2009 International Conference on Field-Programmable Technology; 9–11 December 2009; Sydney, Australia. New York, NY, USA: IEEE. pp. 324-327.
[12] Peemen M, Setio AA, Mesman B, Corporaal H. Memory-centric accelerator design for convolutional neural networks. In: 2013 IEEE International Conference on Computer Design; 6–9 October 2013; North Carolina, USA. New York, NY, USA: IEEE. pp. 13-19.
[13] Zhang C, Li P, Sun G, Guan Y, Xiao B, Cong J. Optimizing FPGA-based accelerator design for deep convolutional neural networks. In: 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays; 22–24 February 2015; Monterey, CA, USA. New York, NY, USA: ACM. pp. 161-170.
[14] Williams S, Waterman A, Patterson D, Hall S, Berkeley UC. Roofline: an insightful visual performance model for multicore architectures. Commun ACM 2009; 52: 65-76.
[15] Zhang C, Wu D, Sun J, Sun G, Luo G, Cong J. Energy-efficient CNN implementation on a deeply pipelined FPGA cluster. In: Proceedings of the 2016 International Symposium on Low Power Electronics and Design; 8–10 August 2016; San Francisco, CA, USA. New York, NY, USA: ACM. pp. 326-331.
[16] Mantovani P, Cota EG, Tien K, Pilato C, Di Guglielmo G, Shepard K, Carlon LP. An FPGA-based infrastructure for fine-grained DVFS analysis in high-performance embedded systems. In: 2016 ACM/EDAC/IEEE Design Automation Conference; 5–9 June 2016; Austin, TX, USA. New York, NY, USA: IEEE. pp. 1-6.
[17] Choi JW, Bedard D, Fowler R, Vuduc R. A roofline model of energy. In: 2013 IEEE International Parallel & Distributed Processing Symposium; 20–24 May 2013; Boston, MA, USA. New York, NY, USA: IEEE. pp. 661-672.
[18] Zhang C, Prasanna VK. Frequency domain acceleration of convolutional neural networks on CPU-FPGA shared memory system. In: 25th ACM/SIGDA International Symposium on Field-Programmable Gate Arrays; 22–24 February 2017; Monterey, CA, USA. New York, NY, USA: ACM. pp. 35-44.
[19] Sadri M, Weis C, Wehn N, Benini L. Energy and performance exploration of accelerator coherency port using Xilinx ZYNQ. In: Proceedings of 10th FPGA World Conference; 10–12 September 2013; Stockholm, Sweden. New York, NY, USA: ACM. p. 5.
[20] Beldachi AF, Nunez-Yanez JL. Accurate power control and monitoring in ZYNQ boards. In: 2014 International Conference on Field Programmable Logic and Applications; 2–4 September 2014; Munich, Germany. New York, NY, USA: IEEE. pp. 1-4.
[21] Qiao Y, Shen J, Xiao T, Yang Q, Wen M, Zhang C. FPGA-accelerated deep convolutional neural networks for high throughput and energy efficiency. Concurr Comp-Pract E 2017; 29: 1-20.