Lip Reading Using Convolutional Neural Networks with and without Pre-Trained Models

Lip Reading Using Convolutional Neural Networks with and without Pre-Trained Models

Lip reading has become a popular topic recently. There are widespread literature studies on lip reading in human action recognition. Deep learning methods are frequently used in this area. In this paper, lip reading from video data is performed using self designed convolutional neural networks (CNNs). For this purpose, standard and also augmented AvLetters dataset is used in train and test stages. To optimize network performance, minibatchsize parameter is also tuned and its effect is investigated. Additionally, experimental studies are performed using AlexNet and GoogleNet pre-trained CNNs. Detailed experimental results are presented.

___

  • S. Agrawal, V. R. Omprakash, Ranvijay, “Lip reading techniques: A survey,” in 2016 2nd International Conference on Applied and Theoretical Computing and Communication Technology (iCATccT), pp. 753–757, July 2016.
  • A. Garg, J. Noyola, S. Bagadia, “Lip reading using CNN and LSTM,” in Technical Report, 2016.
  • Y. Li, Y. Takashima, T. Takiguchi, Y. Ariki, “Lip reading using a dynamic feature of lip images and convolutional neural networks,” in 2016 IEEE/ACIS 15th International Conference on Computer and Information Science (ICIS), pp. 1–6, June 2016.
  • S. Petridis, Z. Li, M. Pantic, “End-to-end visual speech recognition with LSTMs,” CoRR, vol. abs/1701.05847, 2017.
  • Y. Takashima, Y. Kakihara, R. Aihara, T. Takiguchi, Y. Ariki, N. Mitani, K. Omori, K. Nakazono, “Audio-visual speech recognition using convolutive bottleneck networks for a person with severe hearing loss,” IPSJ Transactions on Computer Vision and Applications, vol. 7, pp. 64– 68, 2015.
  • A. Yargic, M. Dogan, “A lip reading application on MS Kinect camera,” in 2013 IEEE INISTA, pp. 1–5, June 2013.
  • A. Rekik, A. Ben-Hamadou, W. Mahdi, “A new visual speech recognition approach for RGB-D cameras,” in Image Analysis and Recognition (A. Campilho and M. Kamel, eds.), (Cham), pp. 21–28, Springer International Publishing, 2014.
  • A. Rekik, A. Ben-Hamadou, W. Mahdi, “Human machine interaction via visual speech spotting,” in Advanced Concepts for Intelligent Vision Systems (S. Battiato, J. Blanc-Talon, G. Gallo, W. Philips, D. Popescu, and P. Scheunders, eds.), (Cham), pp. 566–574, Springer International Publishing, 2015.
  • A. Rekik, A. Ben-Hamadou, W. Mahdi, “Unified system for visual speech recognition and speaker identification,” in Advanced Concepts for Intelligent Vision Systems (S. Battiato, J. Blanc-Talon, G. Gallo, W. Philips, D. Popescu, P. Scheunders, eds.), (Cham), pp. 381–390, Springer International Publishing, 2015.
  • M. Emin Yuksel, N. Sarikaya Basturk, H. Badem, A. Caliskan, A. Basturk, “Classification of high resolution hyperspectral remote sensing data using deep neural networks,” Journal of Intelligent & Fuzzy Systems, vol. 34, pp. 2273–2285, 04 2018.
  • A. Caliskan, M. Yuksel, H. Badem, A. Basturk, “Performance improvement of deep neural network classifiers by a simple training strategy,” Engineering Applications of Artificial Intelligence, vol. 67,pp. 14 – 23, 2018.
  • H. Badem, A. Basturk, A. Caliskan, M. E. Yuksel, “A new efficient training strategy for deep neural networks by hybridization of artificial bee colony and limited–memory bfgs optimization algorithms,” Neurocomputing, vol. 266, pp. 506 – 526, 2017.
  • A. Caliskan, M. Yuksel, H. Badem, A. Basturk, “A deep neural network classifier for decoding human brain activity based on magnetoencephalography,” Elektronika ir Elektrotechnika, vol. 23, no. 2, 2017.
  • A. Fernandez-Lopez, F. M. Sukno, “Survey on automatic lip-reading in the era of deep learning,” Image and Vision Computing, vol. 78, pp. 53 – 72, 2018.
  • I. Matthews, T. Cootes, J. A. Bangham, S. Cox, R. Harvey, “Extraction of visual features for lipreading,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, p. 2002, 2002.
  • A. Krizhevsky, I. Sutskever, G. Hinton, “Imagenet classification with deep convolutional neural networks,” NIPS, vol. 25, pp. 1106–1114, 2012.
  • C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. E. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich, “Going deeper with convolutions,” CoRR, vol. abs/1409.4842, 2014.
  • W. Feng, N. Guan, Y. Li, X. Zhang, Z. Luo, “Audio visual speech recognition with multimodal recurrent neural networks,” in 2017
  • International Joint Conference on Neural Networks (IJCNN), pp. 681– 688, May 2017.
  • I. Anina, Z. Zhou, G. Zhao, M. Pietikainen, “Ouluvs2: A multi-view audiovisual database for non-rigid mouth motion analysis,” in 2015 11th
  • IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), vol. 1, pp. 1–5, May 2015.
  • E. K. Patterson, S. Gurbuz, Z. Tufekci, J. N. Gowdy, “Moving talker, speaker-independent feature study, and baseline results using the CUAVE multimodal speech corpus,” EURASIP J. Appl. Signal Process., vol. 2002, pp. 1189–1201, Jan. 2002.
  • W. Dong, R. He, S. Zhang, “Digital recognition from lip texture analysis,” in 2016 IEEE International Conference on Digital Signal Processing (DSP), pp. 477–481, Oct 2016.
  • T. Stafylakis, G. Tzimiropoulos, “Combining residual networks with LSTMs for lipreading,” CoRR, vol. abs/1703.04105, 2017.
  • J. S. Chung, A. Zisserman, “Lip reading in the wild,” in Asian Conference on Computer Vision, pp. 87–103, Springer, 2016.
  • Y. Takashima, R. Aihara, T. Takiguchi, Y. Ariki, N. Mitani, K. Omori, K. Nakazono, “Audio-visual speech recognition using bimodal trained bottleneck features for a person with severe hearing loss,” in INTERSPEECH, 2016.
  • E. Kilic, Classification of Mitotic figures with convolutional neural networks. M.Sc. thesis, Erciyes University, Graduate School of Natural and Applied Sciences, 2016.
  • H. S. Nogay, T. C. Akinci, “A convolutional neural network application for predicting the locating of squamous cell carcinoma in the lung,” Balkan Journal of Electrical and Computer Engineering, vol. 6, pp. 207 – 210, 2018.
  • H. S. Nogay, “Classification of different cancer types by deep convolutional neural networks,” Balkan Journal of Electrical and Computer Engineering, vol. 6, pp. 56 – 59, 2018.
  • J. Gu, Z. Wang, J. Kuen, L. Ma, A. Shahroudy, B. Shuai, T. Liu, X. Wang, G. Wang, “Recent advances in convolutional neural networks,” CoRR, vol. abs/1512.07108, 2015.
  • K. Simonyan, A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” CoRR, vol. abs/1409.1556, 2014.
  • K. He, X. Zhang, S. Ren, J. Sun, “Deep residual learning for image recognition,” CoRR, vol. abs/1512.03385, 2015.
  • S. Das, “CNNs architectures: LeNet, AlexNet, VGG, GoogLeNet, ResNet and more . . . .” https://medium.com/@siddharthdas-32104, 2017.
  • J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, A. Y. Ng, “Multimodal deep learning,” in Proceedings of the 28th international conference on machine learning (ICML-11), pp. 689–696, 2011.