A unified approach to speech enhancement and voice activity detection

In this paper, a unified system for voice activity detection (VAD) and speech enhancement is proposed. In the proposed system, there is mutual exchange of information between VAD and speech enhancement blocks. A new and robust VAD algorithm is implemented for the VAD block of the unified system. The newly proposed VAD algorithm uses a periodicity measure and an energy measure obtained from spectral energy distribution and spectral energy difference of the input speech data. For the speech enhancement block, the modified Wiener filtering (MWF) approach is utilized. It has been shown that the utilization of information exchange between the VAD and MWF algorithms in the unified system increases the performance of both algorithms and the proposed unified system improves the robustness of a speech recognition system significantly. Both of the enhanced algorithms are noniterative. Therefore, the proposed unified system is computationally attractive for real-time applications.

A unified approach to speech enhancement and voice activity detection

In this paper, a unified system for voice activity detection (VAD) and speech enhancement is proposed. In the proposed system, there is mutual exchange of information between VAD and speech enhancement blocks. A new and robust VAD algorithm is implemented for the VAD block of the unified system. The newly proposed VAD algorithm uses a periodicity measure and an energy measure obtained from spectral energy distribution and spectral energy difference of the input speech data. For the speech enhancement block, the modified Wiener filtering (MWF) approach is utilized. It has been shown that the utilization of information exchange between the VAD and MWF algorithms in the unified system increases the performance of both algorithms and the proposed unified system improves the robustness of a speech recognition system significantly. Both of the enhanced algorithms are noniterative. Therefore, the proposed unified system is computationally attractive for real-time applications.

___

  • J.H. Chang, N.S. Kim, S.K. Mitra, “Voice Activity Detection Based on Multiple Statistical Models”, IEEE Transactions on Signal Processing, vol. 54, no. 6, pp. 1965–1976, 2006.
  • L. Rabiner, M. Sambur, “Voiced-unvoiced-silence detection using the Itakura LPC distance measure”, Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 2, pp. 323–326, 1977.
  • J.D. Hoyt, H. Wechsler, “Detection of human speech in structured noise”, Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 2, pp. 19–22, 1994.
  • J.A. Haigh, J.S. Mason, “Robust voice activity detection using cepstral features”, Proceedings of the IEEE Conference on Computer, Communication, Control and Power Engineering, vol. 3, pp. 321–324, 1993.
  • R. Tucker, “Voice activity detection using a periodicity measure”, Proceedings of the IEE Conference on Communications, Speech and Vision, vol. 139, no. 4, pp. 377–380, 1992.
  • J. Sohn, W. Sung, “A voice activity detector employing soft decision based noise spectrum adaptation”, Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 1, pp. 365–368, 1998.
  • L.M. Arslan, “Modified Wiener Filtering”, Signal Processing, vol. 86, no. 2, pp. 267–272, 2006.
  • C.P. Loizou, Speech Enhancement: Theory and Practice, Boca Raton, FL, CRC Press Inc., 2007.
  • M. Berouti, R. Schwartz, J. Makhoul, “Enhancement of speech corrupted by acoustic noise”, Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 4, pp. 208–211, 1979.
  • S.F. Boll, “Suppression of acoustic noise in speech using spectral subtraction”, IEEE Transactions on Acoustics, Speech and Signal Processing, vol. 27, no. 2, pp. 113–120, 1979.
  • R. McAulay, M. Malpass, “Speech enhancement using a soft-decision noise suppression filter”, IEEE Transactions on Acoustics, Speech and Signal Processing, vol. 28, no. 2, pp. 137–145, 1980.
  • L.M. Arslan, J.H.L. Hansen, “Minimum cost based phoneme class detection for improved iterative speech enhancement”, Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 2, pp. 45–48, 1994.
  • Y. Ephraim, D. Malah, “Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator”, IEEE Transactions on Acoustics, Speech and Signal Processing, vol. 32, no. 6, pp. 1109–1121, 1984. Y. Ephraim, D. Malah, “Speech enhancement using a minimum mean-square error log-spectral amplitude estimator”, IEEE Transactions on Acoustics, Speech and Signal Processing, vol. 33, no. 2, pp. 443–445, 1985.
  • J.H.L. Hansen, M.A. Clements, “Constrained iterative speech enhancement with application to speech recognition”, IEEE Transactions on Signal Processing, vol. 39, no. 4, pp. 795–805, 1991.
  • J.H.L. Hansen, L.M. Arslan, “Markov model based phoneme class partitioning for improved constrained iterative speech enhancement”, IEEE Transactions on Speech and Audio Processing, vol. 3, no. 1, pp. 98–104, 1995.
  • P. Scalart, J.V. Filho, “Speech enhancement based on a priori signal to noise estimation”, Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 2, no. 2, pp. 629–632, 1996.
  • M. Dendrinos, S. Bakamidis, G. Carayannis, “Speech enhancement from noise: A regenerative approach”, Speech Communication, vol. 10, no. 1, pp. 45–57, 1991.
  • Y. Ephraim, H.L.V. Trees, “A signal subspace approach for speech enhancement”, IEEE Transactions on Speech and Audio Processing, vol. 3, no. 4, pp. 251–266, 1995.
  • L.M. Arslan, Ph.D. Thesis, Duke University, 1996.
  • G.E. Peterson, H.L. Barney, “Control Methods Used in a Study of the Vowels”, The Journal of the Acoustical Society of America, vol. 24, no. 2, pp. 175–184, 1952.
  • D.A. Schwartz, C.Q. Howe, D. Purves, “The Statistical Structure of Human Speech Sounds Predicts Musical Universals”, The Journal of Neuroscience, vol. 23, no. 18, pp. 7160–7168, 2003.
  • L. Rabiner, R. Schafer, Digital Processing of Speech Signals, New Jersey, Prentice-Hall Inc., 1978.
  • J. Makhoul, “Linear Prediction: A Tutorial Review”, Proceedings of IEEE, vol. 63, no. 4, pp. 561–580, 1975.
  • ITU, “A silence compression scheme for G.729 optimized for terminals conforming to ITU-T V.70. ITU-T Rec. G. 729, Annex B”, 1996.
  • ETSI, “Voice activity detector (VAD) for adaptive multi-rate (AMR) speech traffic channels, ETSI EN 301 708 Recommendation”, 1999.
  • H. Sak, T. G¨ ung¨ or, Y. Safkan, “A Corpus-Based Concatenative Speech Synthesis System for Turkish”, Turkish Journal of Electrical Engineering and Computer Sciences, vol. 14, no. 2, pp. 209–223, 2006.
  • M.H. Asyalı, M. Yılmaz, M. Tokmak¸cı, K. Sedef, B.H. Aksebzeci, R. Mittal, “Design and implementation of a voice-controlled prosthetic hand”, Turkish Journal of Electrical Engineering and Computer Sciences, vol. 19, no. 1, pp. 33–46, 2011.