Human activity recognition by using MHIs of frame sequences

A motion history image MHI is a temporal template that collapses temporal motion information into a single image in which intensity is a function of recency of motion. In recent years, the popularity of deep learning architectures for human activity recognition has encouraged us to explore the effectiveness of combining them and MHIs. Based on this, two new methods are introduced in this paper. In the first method, which is called the basic method, each video splits into N groups of consecutive frames, and the MHI is calculated for each group. Transfer learning with the fine-tuning technique is used for classifying these temporal templates. The experimental results show that some misclassification errors are created because of the similarities between these temporal templates; these errors can be corrected by detecting specific objects in the scenes. Thus, spatial information consisting of a single frame is also added to the second method, called the proposed method. By converting video classification problems into image classification problems in the proposed method, less memory is needed and the time complexity is greatly reduced. They are implemented and compared with state-of-the-art approaches on two data sets. The results show that the proposed method significantly outperforms the others. It achieves recognition accuracies of 92% and 92.4% for the UCF Sport and UCF-11 action data sets, respectively.

___

  • [1] Le QV, Zou WY, Yeung SY, Ng AY. Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In: CVPR 2011; Colorado, USA; 2011. pp. 3361-3368.
  • [2] Hasan M, Roy-Chowdhury AK. Incremental activity modeling and recognition in streaming videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2014. pp. 796-803.
  • [3] Gilbert A, Bowden R. Image and video mining through online learning. Computer Vision and Image Understanding 2017; 158: 72-84.
  • [4] Chen L, Hoey J, Nugent CD, Cook DJ, Yu Z. Sensor-based activity recognition. IEEE Transactions on Systems, Man, and Cybernetics, Part C Applications and Reviews 2012; 42 (6): 790-808.
  • [5] Hussain Z, Sheng M, Zhang WE. Different approaches for human activity recognition: a survey. arXiv preprint arXiv:1906.05074. 2019 Jun 11.
  • [6] Lucas BD, Kanade T. An iterative image registration technique with an application to stereo vision. In Proceedings of the 7th International Joint Conference on Artificial intelligence - Volume 2 (IJCAI’81). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA; 1981. pp. 674-679.
  • [7] Shi J, Tomasi C. Good Features to Track. Ithaca, NY, USA: Cornell University, 1993.
  • [8] Bobick A, Davis J. An appearance-based representation of action. In: Proceedings of 13th International Conference on Pattern Recognition. Vol. 1. IEEE; 1996. pp. 307-312
  • [9] Bobick AF, Davis JW. The recognition of human movement using temporal templates. IEEE Transactions on Pattern Analysis & Machine Intelligence. 2001; 1 (3): 257-267.
  • [10] Blank M, Gorelick L, Shechtman E, Irani M, Basri R. Actions as space-time shapes. In: Tenth IEEE International Conference on Computer Vision (ICCV’05). IEEE; 2005. pp. 1395-1402
  • [11] Bay H, Ess A, Tuytelaars T, Van Gool L. Speeded-up robust features (SURF). Computer Vision and Image Understanding 2008; 110 (3): 346-359.
  • [12] Lowe DG. Object recognition from local scale-invariant features. In: Proceedings of the International Conference on Computer Vision (ICCV ’99). IEEE Computer Society, USA; 1999. pp. 1150-1157.
  • [13] Scovanner P, Ali S, Shah M. A 3-dimensional sift descriptor and its application to action recognition. In: Proceedings of the 15th ACM International Conference on Multimedia. ACM; 2007. pp. 357-360.
  • [14] Dalal N, Triggs B. Histograms of oriented gradients for human detection. In: International Conference on Computer Vision & Pattern Recognition (CVPR’05). IEEE Computer Society; 2005. pp. 886-893.
  • [15] Klaser A, Marszałek M, Schmid C. A spatiotemporal descriptor based on 3d-gradients. In: BMVC 2008-19th British Machine Vision Conference. British Machine Vision Association; 2008. pp. 275-281.
  • [16] Wang H, Ullah MM, Klaser A, Laptev I, Schmid C. Evaluation of local spatiotemporal features for action recognition. In: BMVC 2009-British Machine Vision Conference. BMVA Press; 2009. pp. 124-131.
  • [17] Kovashka A, Grauman K. Learning a hierarchy of discriminative space-time neighborhood features for human action recognition. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. IEEE; 2010. pp. 2046-2053.
  • [18] Wang H, Kläser A, Schmid C, Cheng-Lin L. Action recognition by dense trajectories. In: CVPR 2011-IEEE Conference on Computer Vision & Pattern Recognition. IEEE; 2011. pp. 3169-3176.
  • [19] Cho J, Lee M, Chang HJ, Oh S. Robust action recognition using local motion and group sparsity. Pattern Recognition 2014; 47 (5): 1813-1825.
  • [20] Abdulmunem A, Lai YK, Sun X. Saliency guided local and global descriptors for effective action recognition. Computational Visual Media 2016; 2 (1): 97-106.
  • [21] Souly N, Shah M. Visual saliency detection using group lasso regularization in videos of natural scenes. International Journal of Computer Vision 2016; 117 (1): 93-110.
  • [22] Weinzaepfel P, Harchaoui Z, Schmid C. Learning to track for spatiotemporal action localization. In: Proceedings of the IEEE International Conference on Computer Vision. IEEE; 2015. pp. 3164-3172.
  • [23] Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R et al. Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE; 2014. pp. 1725-1732.
  • [24] Donahue J, Hendricks LA, Guadarrama S, Rohrbach M, Venugopalan S et al. Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE; 2015. pp. 2625-2634.
  • [25] Yue-Hei Ng J, Hausknecht M, Vijayanarasimhan S, Vinyals O, Monga R et al. Beyond short snippets: deep networks for video classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE; 2015. pp. 4694-4702.
  • [26] Carreira J, Zisserman A. Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE; 2017. pp. 6299-6308.
  • [27] Sharma S, Kiros R, Salakhutdinov R. Action recognition using visual attention. arXiv preprint arXiv:1511.04119. 2015 Nov 12.
  • [28] Zhou Y, Pu N, Qian L, Wu S, Xiao G. Human action recognition in videos of realistic scenes based on multi-scale CNN feature. In: Pacific Rim Conference on Multimedia. Springer; 2017. pp. 316-326.
  • [29] Ji S, Xu W, Yang M, Yu K. 3D convolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 2012; 35 (1): 221-231.
  • [30] Taylor GW, Fergus R, LeCun Y, Bregler C. Convolutional learning of spatiotemporal features. In: European Conference on Computer Vision. Springer; 2010. pp. 140-153.
  • [31] Tran D, Bourdev L, Fergus R, Torresani L, Paluri M. Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision; Santiago, Chile; 2015. pp. 4489-4497.
  • [32] Simonyan K, Zisserman A. Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems. 2014. pp. 568-576.
  • [33] Bilen H, Fernando B, Gavves E, Vedaldi A. Action recognition with dynamic image networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 2017; 40 (12): 2799-2813.
  • [34] Feichtenhofer C, Pinz A, Wildes RP. Temporal residual networks for dynamic scene recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE; 2017. pp. 4728-4737.
  • [35] Wang L, Xiong Y, Wang Z, Qiao Y. Towards good practices for very deep two-stream convnets. arXiv preprint arXiv:1507.02159. 2015 Jul 8.
  • [36] Ikizler-Cinbis N, Sclaroff S. Object, scene and actions: combining multiple features for human action recognition. In: European Conference on Computer Vision. Springer; 2010. pp. 494-507.
  • [37] Ravanbakhsh M, Mousavi H, Rastegari M, Murino V, Davis LS. Action recognition with image based CNN features. arXiv preprint arXiv:1512.03980. 2015 Dec 13.
  • [38] Gammulle H, Denman S, Sridharan S, Fookes C. Two stream LSTM: a deep fusion framework for human action recognition. In: 2017 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE; 2017. pp. 177-186.
  • [39] Wang L, Xu Y, Cheng J, Xia H, Yin J et al. Human action recognition by learning spatiotemporal features with deep neural networks. IEEE Access 2018; 6: 17913-17922.
  • [40] Davis JW. Hierarchical motion history images for recognizing human motion. In: Proceedings IEEE Workshop on Detection and Recognition of Events in Video. IEEE; 2001. pp. 39-46.
  • [41] Keskin C, Erkan A, Akarun L. Real time hand tracking and 3D gesture recognition for interactive interfaces using HMM. ICANN/ICONIPP 2003; 2003: 26-29.
  • [42] Soomro K, Zamir AR. Action recognition in realistic sports videos. In: Computer Vision in Sports. Springer; 2014. pp. 181-208.
  • [43] Liu J, Luo J, Shah M. Recognizing realistic actions from videos in the wild. 2009 IEEE Conference on Computer Vision and Pattern Recognition. Miami, FL, USA; 2009. pp. 1996-2003.