Deep temporal motion descriptor DTMD for human action recognition

Spatiotemporal features have significant importance in human action recognition, as they provide the actor's shape and motion characteristics specific to each action class. This paper presents a new deep spatiotemporal human action representation, the deep temporal motion descriptor DTMD , which shares the attributes of holistic and deep learned features. To generate the DTMD descriptor, the actor?s silhouettes are gathered into single motion templates by applying motion history images. These motion templates capture the spatiotemporal movements of the actor and compactly represent the human actions using a single 2D template. Then deep convolutional neural networks are used to compute discriminative deep features from motion history templates to produce the DTMD. Later, DTMD is used for learning a model to recognize human actions using a softmax classifier. The advantage of DTMD are that DTMD is automatically learned from videos and contains higher-dimensional discriminative spatiotemporal representations as compared to handcrafted features; DTMD reduces the computational complexity of human activity recognition as all the video frames are compactly represented as a single motion template; and DTMD works effectively for single and multiview action recognition. We conducted experiments on three challenging datasets: MuHAVI-Uncut, iXMAS, and IAVID-1. The experimental findings reveal that DTMD outperforms previous methods and achieves the highest action prediction rate on the MuHAVI-Uncut dataset.

___

  • [1] Chua JL, Chang YC, Lim WK. A simple vision-based fall detection technique for indoor video surveillance. Signal, Image and Video Processing 2015; 9 (3): 623-633. doi: 10.1007/s11760-013-0493-7
  • [2] Ibrahim MS, Muralidharan S, Deng Z, Vahdat A, Mori G. A hierarchical deep temporal model for group activity recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2016; Las Vegas, NV, USA; 2016. pp. 1971-1980.
  • [3] Zhang K, Grauman K, Sha F. Retrospective encoders for video summarization. In: Proceedings of the European Conference on Computer Vision 2018; Munich, Germany; 2018. pp. 383-399.
  • [4] Yousaf MH, Habib HA, Azhar K. Fuzzy classification of instructor’s morphological features for autonomous lecture recording system. Information - An International Interdisciplinary Journal 2013; 16: 6367-6382. doi: 10.1007/3- 540-44967-1_67
  • [5] Yousaf MH, Azhar K, Sial HA. A novel vision based Approach for instructors performance and behavior analysis. In: 2015 International Conference on Communications, Signal Processing, and their Applications 2015; UAE; 2015. pp. 1-6.
  • [6] Zhu F, Shao L, Xie J, Fang Y. From handcrafted to learned representations for human action recognition: a survey. Image and Vision Computing 2016; 55: 42-52. doi: 10.1016/j.imavis.2016.06.007
  • [7] Laptev I. On space-time interest points. International Journal of Computer Vision 2005; 64 (2-3): 107-123.
  • [8] Kovashka A, Grauman K. Learning a hierarchy of discriminative space-time neighborhood features for human action recognition. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition 2010; San Francisco, CA, USA; 2010. pp. 2046-2053.
  • [9] Murtaza F, Yousaf MH, Velastin SA. Multi-view human action recognition using 2D motion templates based on MHIs and their HOG description. IET Computer Vision 2016; 10 (7): 758-767. doi: 10.1049/iet-cvi.2015.0416
  • [10] Chaaraoui AA, Climent-Pérez P, Flórez-Revuelta F. Silhouette-based human action recognition using sequences of key poses. Pattern Recognition Letters 2013; 34 (15): 1799-1807. doi: 10.1016/j.patrec.2013.01.021
  • [11] Orrite C, Rodriguez M, Herrero E, Rogez G, Velastin SA. Automatic segmentation and recognition of human actions in monocular sequences. In: 22nd International Conference on Pattern Recognition 2014; Stockholm, Sweden; 2014. pp. 4218-4223.
  • [12] Wang H, Schmid C. Action recognition with improved trajectories. In: IEEE International Conference on Computer Vision 2013; Sydney, Australia; 2013. pp. 3551-3558.
  • [13] Wang Y, Mori G. Human action recognition by semilatent topic models. IEEE Transactions on Pattern Analysis And Machine Intelligence 2009; 31 (10): 1762-1774. doi: 10.1109/tpami.2009.43
  • [14] Ahad MA, Islam MN, Jahan I. Action recognition based on binary patterns of action-history and histogram of oriented gradient. Journal on Multimodal User Interfaces 2016; 10 (4): 335-344. doi: 10.1007/s12193-016-0229-4
  • [15] Ning F, Delhomme D, LeCun Y, Piano F, Bottou L et al. Toward automatic phenotyping of developing embryos from videos. IEEE Transactions on Image Processing 2005; 14: 1360-1371. doi: 10.1109/tip.2005.852470
  • [16] Hinton GE, Osindero S, Teh YW. A fast learning algorithm for deep belief nets. Neural Computation 2006; 18 (7): 1527-1554. doi: 10.1109/tip.2005.852470
  • [17] Ji S, Xu W, Yang M, Yu K. 3D convolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 2012; 35 (1): 221-231.
  • [18] Simonyan K, Zisserman A. Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems 2014; Montreal, Canada; 2014. pp. 568-576.
  • [19] Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R et al. Large-scale video classification with convolutional neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition 2014; Columbus, OH, USA; 2014. pp. 1725-1732. doi: 10.1109/cvpr.2014.223
  • [20] Memisevic R, Hinton G. Unsupervised learning of image transformations. In: IEEE Conference on Computer Vision and Pattern Recognition 2007; Minneapolis, MN, USA; 2007. pp. 1-8. doi: 10.1109/cvpr.2007.383036
  • [21] Sun L, Jia K, Chen K, Yeung DY, Shi BE et al. Lattice long short-term memory for human action recognition. In: IEEE International Conference on Computer Vision 2017; Venice, Italy; 2017. pp. 2147-2156.
  • [22] Donahue J, Hendricks LA, Guadarrama S, Rohrbach M, Venugopalan S et al. Long-term recurrent convolutional networks for visual recognition and description. In: IEEE Conference on Computer Vision and Pattern Recognition 2015; Boston, MA, USA; 2015. pp. 2625-2634.
  • [23] Bilen H, Fernando B, Gavves E, Vedaldi A, Gould S. Dynamic image networks for action recognition. In: IEEE Conference on Computer Vision and Pattern Recognition 2016; Las Vegas, NV, USA; 2016. pp. 3034-3042.
  • [24] Sepulveda J, Velastin SA. F1 score assessment of Gaussian mixture background subtraction algorithms using the MuHAVi dataset. In: 6th International Conference on Imaging for Crime Prevention and Detection (ICDP-15) 2015; London, UK; 2015. pp. 6-8.
  • [25] Bobick AF, Davis JW. The recognition of human movement using temporal templates. IEEE Transactions on Pattern Analysis & Machine Intelligence 2001 (3): 257-267. doi: 10.1109/cvpr.1997.609439
  • [26] Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems 2012; Lake Tahoe, NV, USA; 2012. pp. 1097-1105. doi: 10.1145/3065386
  • [27] Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. arXiv preprint, arXiv:1409.1556, 2014. doi: 10.14257/astl.2016.140.36
  • [28] Iandola FN, Han S, Moskewicz MW, Ashraf K, Dally WJ et al. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and< 0.5 MB model size. arXiv preprint, arXiv: 1602.07360, 2016.
  • [29] Szegedy C, Liu W, Jia Y, Sermanet P, Reed S et al. Going deeper with convolutions. In: IEEE Conference on Computer Vision and Pattern Recognition 2015; Boston, MA, USA; 2015. pp. 1-9. doi: 10.1109/cvpr.2015.7298594
  • [30] He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2016; Las Vegas, NV, USA; 2016. pp. 770-778. doi: 10.1109/cvpr.2016.90
  • [31] Raza A, Yousaf MH, Sial HA, Raja G. HMM-based scheme for smart instructor activity recognition in a lecture room environment. SmartCR 2015; 5 (6): 578-590. doi: 10.6029/smartcr.2015.06.008
  • [32] Nida N, Yousaf MH, Irtaza A, Velastin SA. Bag of deep features for instructor activity recognition in lecture room. In Proceedings of the Springer International Conference on Multimedia Modeling 2019; Athens, Greece; 2019. pp. 481-492.
  • [33] Wang Z, Wang J, Xiao J, Lin KH, Huang T. Substructure and boundary modeling for continuous action recognition. In: IEEE Conference on Computer Vision and Pattern Recognition 2012; Rhode Island, USA; 2012. pp. 1330-1337. IEEE. doi: 10.1109/cvpr.2012.6247818
  • [34] Huang CH, Yeh YR, Wang YC. Recognizing actions across cameras by exploring the correlated subspace. In: European Conference on Computer Vision 2012; Berlin, Germany; 2012. pp. 342-351. doi: 10.1007/978-3-642- 33863-2_34