Tae-Jun PARK, Joon-Hyuk CHANG

Deep Q-network-based noise suppression for robust speech recognition

This study develops the deep Q-network (DQN)-based noise suppression for robust speech recognition purposes under ambient noise. We thus design a reinforcement algorithm that combines DQN training with a deep neural networks (DNN) to let reinforcement learning (RL) work for complex and high dimensional environments like speech recognition. For this, we elaborate on the DQN training to choose the best action that is the quantized noise suppression gain by the observation of noisy speech signal with the rewards of DQN including both the word error rate (WER) and objective speech quality measure. Experiments demonstrate that the proposed algorithm improves speech recognition in various noisy conditions while reducing the computational burden compared to the DNN-based noise suppression method.

PDF

___

[1] Xu Y, Du J, Dai LR, Lee CH. A regression approach to speech enhancement based on deep neural networks. IEEE/ACM Transactions on Audio, Speech and Language Processing 2015; 23 (1): 7-9. doi: 10.1109/TASLP.2014.2364452
[2] Narayanan A, Wang DL. Ideal ratio mask estimation using deep neural networks for robust speech recognition. In: IEEE 2013 International Conference on Acoustics, Speech, and Signal Processing; Vancouver, Canada; 2013. pp. 7092-7096.
[3] Deng L, Yu D, Hinton G. Deep learning for speech recognition and related applications. In: Annual Conference on Neural Information Processing Systems 2009; Vancouver, Canada; 2009.
[4] Jin Z, Wang D. A supervised learning approach to monaural segregation of reverberant speech. IEEE Transactions on Audio, Speech, and Language Processing 2009; 17 (4): 625-638. doi: 10.1109/TASL.2008.2010633
[5] Williamson DS, Wang Y, Wang DL. Complex ratio masking for monaural speech separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 2016; 24 (3): 483-492. doi: 10.1109/TASLP.2015.2512042
[6] Arslan Ö, Engín EZ. Speech enhancement using adaptive thresholding based on gamma distribution of Teager energy operated intrinsic mode functions. Turkish Journal of Electrical Engineering & Computer Sciences 2019; 27 (2): 1355-1370. doi: 10.3906/elk-1804-18
[7] Zhao H, Zarar S, Tashev I, Lee CH. Convolutional-Recurrent Neural Networks for Speech Enhancement. In: IEEE 2018 International Conference on Acoustics, Speech, and Signal Processing; Calgary, Canada; 2018. pp. 2401-2405.
[8] Kabir H, Abdar M, Jalali SMJ, Khosravi A, Atiya AF et al. SpinalNet: Deep neural network with gradual input. arXiv 2020. arXiv:2007.03347v2
[9] Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J. Human-level control through deep reinforcement learning. Nature 2015; 518: 529-533. doi: 10.1038/nature14236
[10] Sugiyama M. Statistical Reinforcement Learning: Modern Machine Learning Approaches. Boca Raton, FL, USA: Chapman and Hall/CRC, 2015.
[11] Koizumi Y, Niwa K, Hioka Y, Kobayashi K, Haneda Y. DNN-based source enhancement self-optimized by reinforcement learning using sound quality measurements. In: IEEE 2017 International Conference on Acoustics, Speech, and Signal Processing; New Orleans, USA; 2017. pp. 81-85.
[12] Shen Y, Huang C, Wang S, Tsao Y, Wang H et al. Reinforcement learning based speech enhancement for robust speech recognition. In: IEEE 2019 International Conference on Acoustics, Speech, and Signal Processing; Brighton, UK; 2019. pp. 6750-6754.
[13] Fakoor R, He X, Tashev I, Zarar S. Reinforcement learning to adapt speech enhancement to instantaneous input signal quality. In: Annual Conference on Neural Information Processing Systems 2017; Long Beach, CA, USA; 2017.
[14] Kala T, Shinozaki T. Reinforcement learning of speech recognition system based on policy gradient and hypothesis selection. In: IEEE 2018 International Conference on Acoustics, Speech, and Signal Processing; Calgary, AB, Canada; 2018. pp. 5759-5763.
[15] Hori T, Astudillo R, Hayashi T, Zhang Y, Watanabe S et al. Cycle-consistency training for end-to-end speech recognition. In: IEEE 2019 International Conference on Acoustics, Speech, and Signal Processing; Brighton, UK; 2019. pp. 4723-4725.
[16] Hu Y, Loizou P. Evaluation of objective quality measures for speech enhancement. IEEE Transactions on Audio, Speech, and Language Processing 2008; 16 (1): 229-238. doi: 10.1109/TASL.2007.911054
[17] Bellman RE. Dynamic Programming. Princeton, NJ, USA: Princeton University Press, 1957.
[18] ITU-T, Rec. P.862. Perceptual evaluation of speech quality (PESQ): An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs. International Telecommunication UnionTelecommunication Standardization Sector 2001.
[19] Zhang XL, Wang D. Boosting contextual information for deep neural network based voice activity detection. IEEE/ACM Transactions on Audio, Speech, and Language Processing 2016; 24 (2): 252-264. doi: 10.1109/TASLP.2015.2505415
[20] Agarap AF. Deep learning using rectified linear units (relu). arXiv 2018. arXiv:1803.08375.
[21] Kingma D, Ba J. Adam: a method for stochastic optimization. In: International Conference on Learning Representations 2015; San Diego, CA, USA; 2015.
[22] Barker J, Marxer R, Vincent E, Watanabe S. The third CHiME speech separation and recognition challenge: dataset, task and baselines. In: IEEE 2015 Workshop on Automatic Speech Recognition and Understanding; Scottsdale, AZ, USA; 2015.
[23] Povey D, Ghoshal A, Boulianne G, Burget L, Glembek O et al. The Kaldi speech recognition toolkit. In: IEEE 2011 Workshop on Automatic Speech Recognition and Understanding; Waikoloa, HI, USA; 2011.
[24] Taal CH, Hendriks RC, Heusdens R, Jensen J. A short-time objective intelligibility measure for time-frequency weighted noisy speech. In: Proceedings of the IEEE 2010 International Conference on Acoustics, Speech and Signal Processing; Dallas, TX, USA; 2010. pp. 4214–4217.