A new method for extraction of speech features using spectral delta characteristics and invariant integration

We propose a new feature extraction algorithm that is robust against noise. Nonlinear filtering and temporal masking are used for the proposed algorithm. Since the current automatic speech recognition systems use invariant-integration and delta-delta techniques for speech feature extraction, the proposed algorithm improves speech recognition accuracy appropriately using a delta-spectral feature instead of invariant integration. One of the nonenvironmental factors that reduce recognition accuracy is the vocal tract length (VTL), leading to a mismatch between the training and testing data. We can use the invariant-integration feature idea for decreasing the VTL effects. The aim of this paper is to provide robust features that provide improvements in different noise conditions as well as being robust against VTL effect changes. This results in more improvement of the recognition accuracy in comparison with mel-frequency cepstral coefficients and perceptual linear prediction in the presence of different types of noises and scenarios.

A new method for extraction of speech features using spectral delta characteristics and invariant integration

We propose a new feature extraction algorithm that is robust against noise. Nonlinear filtering and temporal masking are used for the proposed algorithm. Since the current automatic speech recognition systems use invariant-integration and delta-delta techniques for speech feature extraction, the proposed algorithm improves speech recognition accuracy appropriately using a delta-spectral feature instead of invariant integration. One of the nonenvironmental factors that reduce recognition accuracy is the vocal tract length (VTL), leading to a mismatch between the training and testing data. We can use the invariant-integration feature idea for decreasing the VTL effects. The aim of this paper is to provide robust features that provide improvements in different noise conditions as well as being robust against VTL effect changes. This results in more improvement of the recognition accuracy in comparison with mel-frequency cepstral coefficients and perceptual linear prediction in the presence of different types of noises and scenarios.

___

  • B. Atal, “Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification”, Journal of the Acoustical Society of America, Vol. 55, pp. 1304–1312, 1974.
  • P. Jain, H. Hermansky, “Improved mean and variance normalization for robust speech recognition”, Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, 2001.
  • X. Huang, A. Acero, H.W. Won, Spoken Language Processing: A Guide to Theory, Algorithm, and System Development, Upper Saddle River, NJ, USA, Prentice Hall, 2001.
  • Y. Obuchi, N. Hataoka, R.M. Stern, “Normalization of time-derivative parameters for robust speech recognition in small devices”, IEICE Transactions on Information and Systems, Vol. 87, pp. 1004–1011, 2004.
  • P.J. Moreno, B. Raj, R.M. Stern, “A vector Taylor series approach for environment-independent speech recognition”, Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 2, pp. 733–736, 19 R.M. Stern, B. Raj, P.J. Moreno, “Compensation for environmental degradation in automatic speech recognition”, Proceedings of the ESCA Tutorial and Research Workshop on Robust Speech Recognition for Unknown Communication Channels, pp. 33–42,1997.
  • C. Kim, R.M. Stern, “Power function-based power distribution normalization algorithm for robust speech recognition”, IEEE Automatic Speech Recognition and Understanding Workshop, pp. 188–193, 2009.
  • B. Raj, V.N. Parikh, R.M. Stern, “The effects of background music on speech recognition accuracy”, Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 2, pp. 851–854, 1997.
  • B. Raj, R.M. Stern, “Missing-feature methods for robust automatic speech recognition”, IEEE Signal Processing Magazine, Vol. 22, pp. 101–116, 2005.
  • H. Hermansky, “Perceptual linear prediction analysis of speech”, Journal of the Acoustical Society of America, Vol. 87, pp. 1738–1752, 1990.
  • C. Kim, H. Chiu, R.M. Stern, “Physiologically-motivated synchrony-based processing for robust automatic speech recognition”, InterSpeech, pp. 1975–1978, 2006.
  • K. Kumar, “A spectro-temporal framework for compensation of reverberation for speech recognition”, PhD, Carnegie Mellon University, Pittsburg, PA, USA, 2011.
  • H. Hermansky, N. Morgan, “RASTA processing of speech”, IEEE Transactions on Audio Speech and Language Processing, Vol. 2, pp. 578–58, 1994.
  • L. Deng, A. Acero, M. Plumpe, X. Huang, “Large-vocabulary speech recognition under adverse acoustic environments”, Proceedings of the International Conference on Spoken Language Processing, pp. 806–809, 2000.
  • M.J.F. Gales, “Model-based techniques for noise robust speech recognition”, PhD, Cambridge University, Cambridge, UK, 1995.
  • C. Kim, R.M. Stern, “Feature extraction for robust speech recognition based on maximizing the sharpness of the power distribution and on power flooring”, Proceedings of the International Conference on Audio, Speech, and Signal Processing, pp. 4574–4577, 2010.
  • F. Muller, A. Mertins, “Contextual invariant-integration features for improved speaker-independent speech recognition”, Speech Communication, Vol. 53, pp. 830–841, 2011.
  • B.E.D. Kingsbury, N. Morgan, S. Greenberg, “Robust speech recognition using the modulation spectrogram”, Speech Communication, Vol. 25, pp. 117–132, 1998.
  • H.G. Hirsch, C. Ehrlicher, “Noise estimation techniques or robust speech recognition”, IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 153–156, 1995.
  • C. Kim, R.M. Stern, “Nonlinear enhancement of onset for robust speech recognition”, InterSpeech, pp. 2058–2061, 20
  • S.F. Boll, “Suppression of acoustic noise in speech using spectral subtraction”, IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol. 27, pp. 113–120, 1979.
  • C. Lemyre, M. Jelinek, R. Lefebvre, “New approach to voiced onset detection in speech signal and its application for frame error concealment”, IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 4757–4760, 2008.
  • S.R.M. Prasanna, P. Krishnamoorthy, “Vowel onset point detection using source, spectral peaks, and modulation spectrum energies”, IEEE Transactions on Audio, Speech, and Language Processing, Vol. 17, pp. 556–565, 2009. F.M¨ uller, A. Mertins, “Noise robust speaker-independent speech recognition with invariant-integration features using power-bias subtraction”, Speech Communication, Vol. 53, pp. 830–841, 2011.
  • S. Furui, “Speaker-independent isolated word recognition based on emphasized spectral dynamics”, Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, 1986.
  • T. Gramss, “Word recognition with the feature finding neural network (FFNN)”, Proceedings of the IEEE Workshop on Neural Networks for Signal Processing, pp. 289–298, 1991.
  • M. Bijankhan, J. Sheikhzadegan, “FARSDAT – The speech database of Farsi spoken language”, Proceedings of the 5th Australian International Conference on Speech Science and Technology, Vol. 2, pp. 826–831, 1994.