Bandwidth extension of narrowband speech in log spectra domain using neural network

In recent years, there have been significant advances in communication technology, but speech signals still suffer from low perceived quality caused by bandwidth limitations of telephone networks. The bandwidth extension (BWE) approach adds high-frequency components of the speech signal to band-limited telephone speech and increases speech perception significantly. In this work, we develop a new method for representation of vocal tract filter coefficients using log of filter bank energy (LFBE) parameters as an alternative for mel-frequency cepstral coefficients (MFCCs). This approach is based on a strong correlation between the spectral components of low- and high-band spectrums. Furthermore, the performances of Gaussian mixture model and multilayer perceptron neural network methods for estimation of the high-frequency envelope are evaluated. Objective evaluations of the obtained results indicate that the LFBE feature vectors have better performance than the MFCCs. In addition, findings of the objective evaluations showed that using a neural network, which is not common in BWE, achieves a better performance as compared to the Gaussian mixture model.

Bandwidth extension of narrowband speech in log spectra domain using neural network

In recent years, there have been significant advances in communication technology, but speech signals still suffer from low perceived quality caused by bandwidth limitations of telephone networks. The bandwidth extension (BWE) approach adds high-frequency components of the speech signal to band-limited telephone speech and increases speech perception significantly. In this work, we develop a new method for representation of vocal tract filter coefficients using log of filter bank energy (LFBE) parameters as an alternative for mel-frequency cepstral coefficients (MFCCs). This approach is based on a strong correlation between the spectral components of low- and high-band spectrums. Furthermore, the performances of Gaussian mixture model and multilayer perceptron neural network methods for estimation of the high-frequency envelope are evaluated. Objective evaluations of the obtained results indicate that the LFBE feature vectors have better performance than the MFCCs. In addition, findings of the objective evaluations showed that using a neural network, which is not common in BWE, achieves a better performance as compared to the Gaussian mixture model.

___

  • LSD measurements. As shown in Figures 6 and 7, when implementing BWE using a neural network, the best result is produced when the input is normalized and the output is nonnormalized.
  • For the best result obtained from LFBE feature vectors, the neural network resulted in an improvement of 2.87 dB in LSD measurement and improvement of 0.33 dB in IS assessment as compared to the GMM. These improvements for MFCC feature vectors will be 2.47 dB and 0.29 dB, respectively.
  • Bandwidth extension increases the quality of narrowband speech signals by adding the lost information to it. In this research, 2 methods, GMM and neural network, were used to estimate a high-band spectral envelope from narrowband information.
  • Furthermore, 2 MFCC and LFBE feature vectors were used.
  • The MFCC parameters were widely used in BWE approaches in previous studies, but the LFBEs for implementation of artificial bandwidth extension with a neural network were used for the first time in this research. Objective evaluations of both results of implementation with the GMM and neural network indicate the superiority of LFBE over MFCCs.
  • Nels R, Vedstesen SA. Artificial bandwidth extension of narrowband speech. MSc, Aalborg University, Aalborg, Denmark, 2007.
  • Shahina A, Yegnanarayana B. Mapping neural networks for bandwidth extension of narrowband speech. In: Annual Conference of the International Speech Communication Association, Interspeech; 17–21 September 2006; Pittsburgh, PA, USA. pp. 1435–1438.
  • Iser B, Schmidt G. Bandwidth extension of telephony speech. In: Adali T, Haykin S, editors. Adaptive Signal Processing: Next Generation Solutions. New York, NY, USA: Wiley, 2010. pp. 349–391.
  • Peter J, Peter V. On artificial bandwidth extension of telephone speech. Signal Process 2003; 83: 1707–1719.
  • Nour-Eldin AH, Kabal P. Objective analysis of the effect of memory inclusion on bandwidth extension of narrowband speech. In: Annual Conference of the International Speech Communication Association, Interspeech; 27–31 August 2007; Antwerp, Belgium. pp. 2489–2492.
  • Liu Ch, Fu Q, Narayanan S. Effect of bandwidth extension to telephone speech recognition in cochlear implant users. J Acoust Soc Am 2009; 125: 77–83.
  • Qian Y, Kabal P. Dual-mode wideband speech recovery from narrowband speech. In: Annual Conference of the International Speech Communication Association, Interspeech; 1–4 September 2003; Geneva, Switzerland. pp. 1433– 1436.
  • Nour-Eldin AH, Kabal P. Combining frontend-based memory with MFCC features for bandwidth extension of narrowband speech. In: IEEE 2009 International Conference on Acoustics Speech and Signal Processing, ICASSP; 19–24 April 2009; Taipei, Taiwan:. pp. 4001–4004.
  • Morales N, Hansen JHL, Toledano DT. MFCC compensation for improved recognition of filtered and band-limited speech. In: IEEE 2005 International Conference on Acoustics Speech and Signal Processing, ICASSP; 18–23 March
  • 2005; Philadelphia, USA. pp. 521–524.
  • Kamal Omar M, Hasegawa-Johnson M. Maximum conditional mutual information projection for speech recognition. In: IEEE 2002 International Conference on Acoustics Speech and Signal Processing, ICASSP; 13–17 May 2002; Orlando, FL, USA. pp. I81–84j.
  • Jax P, Vary P. Feature selection for improved bandwidth extension of speech signal. In: IEEE 2004 International Conference on Acoustics Speech and Signal Processing, ICASSP; 17–21 May 2004; Montreal, Canada. pp. 697–700.
  • Pulakka H, Remes U, Palomaki K, Kurimo M, Alku P. Speech bandwidth extension using Gaussian mixture model- based estimation of the high-band mel spectrum. In: IEEE 2011 International Conference on Acoustics Speech and Signal Processing, ICASSP; 22–27 May 2011; Prague, Czech Republic. pp. 5100–5103.
  • Soon IY, Yeo CK. Bandwidth extension of narrowband speech using cepstral analysis. In: International Symposium on Intelligent Multimedia, Video and Speech Processing, ISIMP; 20–22 October 2004; Hong Kong. pp. 242–245.
  • Bauer P, Fingscheidt T. An HMM-based artificial bandwidth extension evaluated by cross-language training and test. In: IEEE 2008 International Conference on Acoustics Speech and Signal Processing, ICASSP; March 30–April
  • 4 2008; Las Vegas, NV, USA. pp. 4589–4592.
  • Seltzer ML, Acero A, Droppo J. Robust bandwidth extension of noise- corrupted narrowband speech. In: Annual Conference of the International Speech Communication Association, Interspeech; 4–8 September 2005; Lisbon, Portugal. pp. 1509–1512.
  • Nilsson M, Kleijn WB. Avoiding over-estimation in bandwidth extension of telephony speech. In: IEEE 2001 International Conference on Acoustics, Speech, and Signal Processing, ICASSP; 7–11 May 2001; Salt Lake City,
  • UT, USA. pp. 869–872.
  • Park KY, Kim HS. Narrowband to wideband conversion of speech using GMM based transformation. In: IEEE 2000 International Conference on Acoustics Speech and Signal Processing, ICASSP; 5–9 June 2000; ˙Istanbul, Turkey. pp.1843–1846.
  • Pulakka H, Myllyla V, Laaksonen L, Alku P. Bandwidth extension of telephone speech using a filter bank implemen- tation for highband mel spectrum. In: 18th European Signal Processing Conference; 23–27 August 2010; Aalborg,
  • Denmark. pp. 599–602.
  • Kim W, Hansen JHL. Time-frequency correlation-based missing-feature reconstruction for robust speech recognition in band-restricted conditions. IEEE T Audio Speech 2009; 17: 1292–1304.
  • Blumer A, Ehrenfeucht A, Haussler D, Warmuth M. Learnability and the Vapnik–Chervonenkis dimension. J Assoc Comput Mach 1989; 36: 929–965.
  • Waibel A, Hanazawa T, Hinton G, Shikano K, Lang K. Phoneme recognition using time-delay neural networks. IEEE T Audio Speech 1989; 37: 328–339.
  • Milner B, Shao X. Speech reconstruction from mel-frequency cepstral coefficients using a source-filter model. In: Annual Conference of the International Speech Communication Association, Interspeech; 16–20 September 2002; Denver, CO, USA. pp. 2421–2424.
  • Chazan D, Hoory R, Cohen G, Zibulski M. Speech reconstruction from mel frequency cepstral coefficients and pitch frequency. In: IEEE 2000 International Conference on Acoustics Speech and Signal Processing, ICASSP; 5–9 June
  • 2000; ˙Istanbul, Turkey. pp. 1299–1302.
  • Nour-Eldin AH, Kabal P. Mel-frequency cepstral coefficient-based bandwidth extension of narrowband speech. In: Annual Conference of the International Speech Communication Association, Interspeech; 22–26 September 2008; Brisbane, Australia. pp. 53–56.
  • Garofolo JS, Lamel LF, Fisher WM, Fiscus JG, Pallett DS, Dahlgren NL. DARPA TIMIT acoustic-phonetic continuous speech corpus. US Department of Commerce, NIST Speech Disc 1-1.1 Edition, Documentation for the TIMIT Database. Washington, DC, USA: US Department of Commerce; 1993.
  • Laaksonen L, Pulakka H, Myllyl¨a V, Alku P. Development evaluation and implementation of an artificial bandwidth extension method of telephone speech in mobile terminal. IEEE T Consum Electr 2009; 55: 1062–1078.
Turkish Journal of Electrical Engineering and Computer Science-Cover
  • ISSN: 1300-0632
  • Yayın Aralığı: Yılda 6 Sayı
  • Yayıncı: TÜBİTAK