A FAST AND ENERGY EFFICIENT PARALLEL IMAGE FILTERING IMPLEMENTATION ON RASPBERRY PI'S GPU

This paper presents a powerful processing technique for fast and energy-efficient image filtering algorithm focusing energy and time-sensitive embedded and robotic platforms. Digital video processing is getting more and more popular in battery-powered devices like mobile robots and smartphones whereas in most cases, it leads overhead on the main central processing unit (CPU) and it consumes a significant amount of energy from the battery. It is suitable for parallelism since there is no data dependency between the steps of the two-dimensional convolution algorithm. We propose a vector version of the two-dimensional convolution algorithm, which can run parallel on embedded processors that has general purpose graphic processing unit (GPGPU), to reduce computation time and energy consumption. Our in-depth experiments shows that using GPGPU could reduce the execution time while guaranteeing lower power consumption and offloading the system CPU. Experimental results showed that we achieved up to 105 times faster operation and 100 times less energy consumption compared to the CPU implementation. Besides, we reduced the CPU overhead up to 10 times.

Keywords:

General Purpose Graphic Processing Unit (GPGPU), parallelism, parallel processing vector processing, speedup,

PDF

___

Shin et al.(2013). 28nm High-k Metal-Gate Heterogeneous Quad-Core CPUs for High-Performance and Energy-Efficient Mobile Application Processor. In 2013 International SoC Design Conference (ISOCC), Busan, Korea, 198-201.
Hsu, C. H., Kremer, U., Compiler-Directed Dynamic Voltage and Frequency Scaling for CPU Power and Energy Reduction, Ph. D. thesis, Rutgers University, New Jersey, USA, 2003
Hennessy, J. L., Patterson, D. A., Computer Architecture: A Quantitative Approach, Elsevier, Waltham, USA, 2011.
Chhugani et al. (2008). Efficient Implementation of Sorting on Multi-Core SIMD CPU Architecture. Proceedings of the VLDB Endowment, Auckland, New Zealand, 1313-1324.
Blake et al. (2009). A Survey of Multicore Processors. IEEE Signal Processing Magazine, 26(6), 26-37.
Ou et al. (2012). Energy-and cost-efficiency analysis of arm-based clusters. In 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, Ottawa, Canada, 115-123.
Yang et al. (2008). Parallel Image Processing Based on CUDA. International Conference on Computer Science and Software Engineering, Wuhan, China, 198-201.
Gelke et al. (2016). Using mobile processors for general purpose industrial signal processing. In Embedded World Exhibition and Conference, Nuremberg, Germany.
Peroni et al. (2019). ARGA: Approximate reuse for GPGPU acceleration. In 2019 56th ACM/IEEE Design Automation Conference (DAC), Las Vegas, NV, USA, (1-6).
Hecht, V., Ronner, K. (1991). An Advanced Programmable 2D-Convolution Chip for, Real Time Image Processing. In 1991., IEEE International Sympoisum on Circuits and Systems, Singapore, Singapore, 1897–1900.
***, VideoCore® IV 3D Architecture Reference Guide, https://usermanual.wiki/Pdf/VideoCoreC2AE20IV203D20Architecture20Reference20Guide.1278100948/view
Cypher, R., Sanz, J. L. (1989). SIMD architectures and algorithms for image processing and computer vision. IEEE Transactions on Acoustics, Speech, and Signal Processing, 37(12), 2158-2174.
Moreland et al. (2011). Dax toolkit: A proposed framework for data analysis and visualization at extreme scale. In 2011 IEEE Symposium on Large Data Analysis and Visualization, Providence, RI, USA, 97–104.
Buttlar et al. PThreads Programming: a POSIX Standard for better multiprocessing, O'Reilly Media, Inc, Sebastopol, CA, USA, 1996.