Ashraf Ali SHAIK, Venkata Durga Prasad MAREEDU, Venkata Vijaya Kishore POLURIE

Learning multiview deep features from skeletal sign language videos for recognition

The most challenging objective in machine translation of sign language has been the machine’s inability to learn interoccluding finger movements during an action process. This work addresses the problem of teaching a deep learning model to recognize differently oriented skeletal data. The multi-view 2D skeletal sign language video data is obtained using 3D motion-captured system. A total of 9 signer views were used for training the proposed network and the 6 for testing and validation. In order to obtain multi-view deep features for recognition, we proposed an end-to-end trainable multistream convolutional neural network (CNN) with late feature fusion. The fused multiview features are then inputted to a two-layer dense and a decision making softmax. The proposed CNN employs numerous layers to characterize view correspondence to generate maximally discriminative features. This study is important to understand the effects of multiview data processing by CNNs for sign language recognition in decoding joint spatial information. Further, deeper perspectives were developed into multiview processing of CNNs by applying skeletal action data

PDF

___

[1] Kishore PVV, Prasad MVD, PCR, Rahul R. 4-Camera model for sign language recognition using elliptical fourier descriptors and ANN. In: 2015 International Conference on Signal Processing and Communication Engineering Systems; Guntur, India; 2015. pp. 34-38.
[2] Kishore PVV, Sastry ASCS, Kartheek A. Visual-verbal machine interpreter for sign language recognition under versatile video backgrounds. In: 2014 First International Conference on Networks & Soft Computing (ICNSC 2014); Guntur, India; 2014. pp. 135-140.
[3] Oz C, Leu MC. American Sign Language word recognition with a sensory glove using artificial neural networks. Engineering Applications of Artificial Intelligence 2011; 24(7):1204-1213.
[4] Ravi S, Suman M, Kishore PVV, E KK, M TKK, D AK. Multi modal spatio temporal co-trained CNNs with single modal testing on RGB–D based sign language gesture recognition. Journal of Computer Languages 2019; 52: 88-102.
[5] Mittal A, Kumar P, Roy PP, Balasubramanian R, Chaudhuri BB. A modified LSTM Model for continuous sign language recognition using leap motion. IEEE Sensors Journal 2019; 19(16): 7056-7063.
[6] Kumar EK, Kishore PVV, Sastry ASCS, Kumar MTK, Kumar DA. Training CNNs for 3-D Sign language recognition with color texture coded joint angular displacement maps. IEEE Signal Processing Letters 2018; 25(5): 645-649.
[7] Kishore PVV, Kumar DA, Sastry ASCS, Kumar EK. Motionlets Matching With Adaptive Kernels for 3-D Indian Sign Language Recognition. IEEE Sensors Journal 2018; 18(8): 3327-3337.
[8] Cheok M, Jin Z, Omar MH, Jaward . A review of hand gesture and sign language recognition techniques. Interna- tional Journal of Machine Learning and Cybernetics 2019; 10(1): 131-153.
[9] Kumar P, Gauba H, Roy PP, Dogra DP. A multimodal framework for sensor based sign language recognition. Neurocomputing 2017; 259: 21-38.
[10] Mohandes M, Deriche M, Liu J. Image-Based and Sensor-Based Approaches to Arabic Sign Language Recognition. IEEE Transactions on Human-Machine Systems 2014; 44(4): 551-557.
[11] Wei W, Wong Y, Du Y, Hu Y, Kankanhalli M, Geng W. A multi-stream convolutional neural network for sEMG- based gesture recognition in muscle-computer interface. Pattern Recognition Letters 2019; 119: 131-138.
[12] Farfade S, Sudhakar MJ, Saberian LJ, Li . Multi-view face detection using deep convolutional neural networks. Proceedings of the 5th ACM on International Conference on Multimedia Retrieval (ICMR 2015); Shanghai China; 2015; 1: 643-650.
[13] Gao Z, Zhang H, Xu GP, Xue YB, Hauptmann AG. Multi-view discriminative and structured dictionary learning with group sparsity for human action recognition. Signal Processing 2015; 112: 83-97.
[14] Wang L, Ding Z, Tao Z, Liu Y, Fu Y. Generative multi-view human action recognition. Proceedings of the IEEE International Conference on Computer Vision. 2019; Seoul, South Korea; 1: 6212-6221.
[15] Cui J, Li S, Xia Q, Hao A, Qin H. Learning multi-view manifold for single image based modeling. Computers & Graphics. 2019; 82: 275-285.
[16] Chaurasia G, Sorkine O, Drettakis G. Silhouette-Aware Warping for Image-Based Rendering. Computer Graphics Forum. 2011; 30(4): 1223-1232.
[17] Flynn J, Neulander I, Philbin J, Snavely N, Deepstereo . Learning to predict new views from the world’s imagery. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016; Las Vegas, NV, USA; 1: 5515-5539.
[18] Zhou T, Tulsiani S, Sun W, Malik J, Efros AA. View synthesis by appearance flow. Proceedings of the European Conference on Computer Vision. 2016; Amsterdam, Netherlands; 1: 286-301.
[19] Hinton GE, Krizhevsky A, Wang SD. Transforming auto-encoders. Proceedings of the International Conference on Artificial Neural Networks. 2011; Espoo, Finland; 1: 44-51.
[20] Gao Z, Wang DY, Xue YB, Xu GP, Zhang H, Wang YL. 3D object recognition based on pairwise multi-view convolutional neural networks. Journal of Visual Communication and Image Representation 2018; 56: 305-315.
[21] He T, Mao H, Yi Z. Moving object recognition using multi-view three-dimensional convolutional neural networks. Neural Computing and Applications 2017; 28(12): 3827-3835.
[22] Wang D, Ouyang W, Li W, Xu D. Dividing and aggregating network for multi-view action recognition. Proceedings of the European Conference on Computer Vision (ECCV). 2018; 1: 451-467.
[23] Zhou Y, Shao L. Aware attentive multi-view inference for vehicle re-identification. Proceedings of the IEEE Con- ference on Computer Vision and Pattern Recognition. 2018; 1: 6489-6498.
[24] Su H, Maji S. Multi-view convolutional neural networks for 3d shape recognition. Proceedings of the IEEE Inter- national Conference on Computer Vision 2015; 1: 945-953
[25] Setio AAA, Ciompi F, Litjens G, et al. Pulmonary Nodule Detection in CT Images: False Positive Reduction Using Multi-View Convolutional Networks. IEEE Transactions on Medical Imaging 2016; 35(5): 1160-1169.
[26] Feng Y, Zhang Z, Zhao X, Ji R, Gao Y. GVCNN: Group-view convolutional neural networks for 3D shape recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018; 1: 264-272.
[27] Li XX, Cao Q, Wei S. 3D object retrieval based on multi-view convolutional neural networks. Multimedia Tools and Applications 2017; 76(19), 20111-2012
[28] Ijjina EP, Chalavadi KM. Human action recognition in RGB-D videos using motion sequence information and deep learning. Pattern Recognition 2017; 72: 504-516.
[29] Li C, Wang P, Wang S, Hou Y, Li W. Skeleton-based action recognition using LSTM and CNN. In: IEEE International Conference on Multimedia & Expo Workshops (ICMEW); Hong Kong; 2017. pp 585-590.
[30] Liu M, Liu H, Chen C. Enhanced skeleton visualization for view invariant human action recognition. Pattern Recognition 2017; 68: 346-362.
[31] Zhu F, Shao L, Lin M. Multi-view action recognition using local similarity random forests and sensor fusion. Pattern Recognition Letters 2013; 34(1): 20-24.
[32] Iosifidis A, Tefas A, Pitas I. Multi-view action recognition based on action volumes, fuzzy distances and cluster discriminant analysis. Signal Processing 2013; 93(6): 1445-1457.
[33] Yan Y, Liu G, Ricci E, Sebe N. Multi-task linear discriminant analysis for multi-view action recognition. In: IEEE International Conference on Image Processing; Melbourne, Australia; 2013. pp 2842-2846.
[34] Chaaraoui A, Andre P, Climent-Pérez F, Flórez-Revuelta . An eﬀicient approach for multi-view human action recognition based on bag-of-key-poses. In International Workshop on Human Behavior Understanding. Springer; Vilamoura, Portugal; 2012. pp 29-40.
[35] Ge L, Liang H, Yuan J, Thalmann D. Robust 3d hand pose estimation in single depth images: from single-view cnn to multi-view cnns. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016; Las Vegas, NV, USA; 1: 3593-3601.
[36] Liu AA, Xu N, Su YT, Lin H, Hao T, Yang ZX. Single/multi-view human action recognition via regularized multi-task learning. Neurocomputing 2015;151:544-553.
[37] Shahroudy A, Liu J, Ng TT, Wang G. Ntu rgb+ d: A large scale dataset for 3d human activity analysis. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016; Las Vegas, NV, USA; 1: 1010-1019.
[38] Li M, Leung H. Multiview Skeletal Interaction Recognition Using Active Joint Interaction Graph. IEEE Transactions on Multimedia 2016; 18(11): 2293-2302.
[39] Maddala TKK, Kishore PVV, Eepuri KK, Dande AK. YogaNet: 3-D Yoga Asana Recognition Using Joint Angular Displacement Maps With ConvNets. IEEE Transactions on Multimedia 2019; 21(10): 2492-2503.
[40] Gao Z, Han T, Zhang H, Xue Y, Xu G. MMA: a multi-view and multi-modality benchmark dataset for human action recognition. Multimedia Tools and Applications 2018; 77(22): 29383-29404