Scale-invariant MFCCs for speech/speaker recognition

Scale-invariant MFCCs for speech/speaker recognition

The feature extraction process is a fundamental part of speech processing. Mel frequency cepstral coefficients(MFCCs) are the most commonly used feature types in the speech/speaker recognition literature. However, the MFCCframework may face numerical issues or dynamic range problems, which decreases their performance. A practicalsolution to these problems is adding a constant to filter-bank magnitudes before log compression, thus violating thescale-invariant property. In this work, a magnitude normalization and a multiplication constant are introduced to makethe MFCCs scale-invariant and to avoid dynamic range expansion of nonspeech frames. Speaker verification experimentsare conducted to show the effectiveness of the proposed scheme.

___

  • Davis SB, Mermelstein P. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech, and Signal Processing 1980; 4: 357-366.
  • Young S, Kershaw J, Odell D, Valtchey V, Woodland P. The HTK Book Version 3.0. Cambridge, UK: Cambridge University Press, 2000.
  • Alam MJ, Kenny P, O’Shaughnessy D. Robust feature extraction based on an asymmetric level-dependent auditory filterbank and a subband spectrum enhancement technique. Digital Signal Processing 2014; 29 (1): 147–157.
  • Borsky M, Mizera P, Pollak P, Nouza J. Dithering techniques in automatic recognition of speech corrupted by MP3 compression: Analysis, solutions and experiments. Speech Commununication 2017; 86 (1): 75–84. doi: 10.1016/j.specom.2016.11.007
  • Zhu W, O’Shaughnessy D. Log-energy dynamic range normalization for robust speech recognition. In: IEEE International Conference on Acoustics, Speech, and Signal Processing; Philadelphia, PA, USA; 2005. pp. 245–248.
  • Giurgiu M, Kabir A. Improving automatic speech recognition in noise by energy normalization and signal resynthesis. In: IEEE 7th International Conference on Intelligent Computer Communication and Processing; Philadelphia, PA, USA; 2011. pp. 311–314.
  • Li Q, Zheng J, Tsai A, Zhou Q. Robust endpoint detection and energy normalization for real-time speech and speaker recognition. IEEE Transactions on Speech and Audio Processing 2002; 10 (3): 146–157.
  • Dişken G, Tüfekci Z, Çevik U. A robust polynomial regression-based voice activity detector for speaker verification. EURASIP Journal on Audio, Speech and Music Processing 2017; 2017 (23): 1-16.
  • Sadjadi SO, Slaney M, Heck L. MSR Identity Toolbox v1.0: A MATLAB toolbox for speaker recognition research. IEEE Speech and Language Processing Technical Committee Newsletter 2013; 1 (4): 1–4.
Turkish Journal of Electrical Engineering and Computer Sciences-Cover
  • ISSN: 1300-0632
  • Yayın Aralığı: Yılda 6 Sayı
  • Yayıncı: TÜBİTAK