PARABOLIC FILTER MEL FREQUENCY CEPSTRAL COEFFICIENT AND FUSION OF FEATURES FOR SPEAKER AGE CLASSIFICATION

PARABOLIC FILTER MEL FREQUENCY CEPSTRAL COEFFICIENT AND FUSION OF FEATURES FOR SPEAKER AGE CLASSIFICATION

Speech is an acoustic signal initiated at the inner end of the human vocal tract and radiated as an audio wave at the tip of the outer end. The structure and length of the vocal tract makes distinctions on features taken from speeches similar in content, but uttered by different speakers. As a person grows his/her vocal tract changes in length which in turn modifies speech characteristics gradually. The mel frequency cepstral coefficient (MFCC) which uses triangular band pass filter banks has been widely regarded as the most popular feature used in most speech processing applications. To improve the accuracy of speaker age classification a new spectral based feature set named as parabolic filter mel frequency cepstral coefficient (PFMFCC) is proposed in this study. PFMFCC uses parabolic band pass filter banks instead of the triangular ones. This feature extraction technique uses 30 parabolic band pass filter banks to extract 42 features from each speech frame of length 20 ms. These features are applied to three classical classifiers, namely the Gaussian mixture model (GMM), cosine score, and probabilistic linear discriminant analysis (PLDA). The aGender database consisting of 47 hours of German speech uttered by a total of 852 speakers is used in this study. The new PFMFCC feature achieved 51.01%, 56.01% and 58.14% accuracies with cosine score, GMM and PLDA classifiers respectively on the female dataset. Similarly it achieved 50.44%, 52.74% and 57.23% accuracies with cosine score, GMM and PLDA classifiers respectively on the male dataset. Using feature fusion of seven feature sets overall accuracies of 60.18%, 52.17% and 56.35% are obtained on cosine score, GMM and PLDA classifiers respectively for all the seven speaker age classes. The feature fusion has improved the overall accuracy by 2.55% using cosine score compared to a related speaker age classification study carried out on the same database previously

___

  • ⦁ Mysak, Edward D., (1959) Pitch and duration characteristics of older males, Journal of Speech& Hearing Research, 2(1), pp.46-54.
  • ⦁ Minematsu, Nobuaki, M. Sekiguchi, and K. Hirose, (2002) Automatic estimation of one's age with his/her speech based upon acoustic modeling techniques of speakers, in IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), vol. 1, pp. I-137-I-140.
  • ⦁ Muller, Christian, F. Wittig, and J. Baus, (2003) Exploiting speech for recognizing elderly users to respond to their special needs, in Eighth European Conference on Speech Communication and Technology, pp. 1305-1308.
  • ⦁ Spiegl, Werner, G. Stemmer, E. Lasarcyk, V. Kolhatkar, A. Cassidy, B. Potard, et al., (2009) Analyzing features for automatic age estimation on cross-sectional data, In INTERSPEECH 2009, vol. 10, pp. 2923-2926.
  • ⦁ Li M, Jung C-S, Han KJ , (2010) Combining five acoustic level modeling methods for automatic speaker age and gender recognition, In: INTERSPEECH2010, pp. 2826–2829.
  • ⦁ Ajmera, J., Burkhardt, F., (2008) Age and gender classification using modulation cepstrum, In: Proc. Odyssey, pp. 025.
  • ⦁ F. Burkhardt, Eckert, M., Johannsen, W. and J. Stegmann, (2010) A database of age and gender annotated telephone speech, Proceedings of the Language and Resources Conference (LREC).
  • ⦁ Mallouh, Arafat Abu, Zakariya Qawaqneh, and Buket D. Barkana, (2018) New transformed features generated by deep bottleneck extractor and a GMM–UBM classifier for speaker age and gender classification. Neural Computing and Applications 30(8): pp. 2581-2593.
  • ⦁ H. Hermansky and N. Morgan, (1994) RASTA processing of speech, IEEE Transactions on Speech and Audio Processing, 2(4): pp. 578–589.
  • ⦁ R. M. Hegde, H. A. Murthy, and V. R. R. Gadde, (2006) Significance of the modified group delay feature in speech recognition, IEEE Transactions on Audio, Speech, and Language Processing, 15(1): pp. 190–202.
  • ⦁ R. Schluter and H. Ney, (2001) Using phase spectrum information for improved speech recognition performance, in 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings vol. 1, pp. 133–136.
  • ⦁ C. Hanilçi, (2017) Features and classifiers for replay spoofing attack detection, in 2017 10th International Conference on Electrical and Electronics Engineering (ELECO), pp. 1187–1191.
  • ⦁ Harris, Fredric J. (1978) On the use of windows for harmonic analysis with the discrete Fourier transform. Proceedings of the IEEE, 66(1): pp. 51-83.
  • ⦁ Douglas A. Reynolds, T. F. Quatieri, and R. B. Dunn, (2000) Speaker verification using adapted Gaussian mixture models, in Digital Signal Processing, Vol. 10, pp.19–41.
  • ⦁ K. W. Gamage, V. Sethu, P. N. Le, and E. Ambikairajah, (2015) An i-vector GPLDA system for speech based emotion recognition, in 2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), pp. 289–292.
  • ⦁ GRZYBOWSKA, Joanna; KACPRZAK, Stanislaw, (2016) Speaker Age Classification and Regression Using i-Vectors. In: INTERSPEECH. pp. 1402-1406.
  • ⦁ Moon, T. K. (1996) The expectation-maximization algorithm, IEEE Signal processing magazine, 13(6), 47-60.
  • ⦁ Sadjadi, Seyed Omid, Malcolm Slaney, and Larry Heck. (2013) MSR identity toolbox v1. 0: A MATLAB toolbox for speaker-recognition research. Speech and Language Processing Technical Committee Newsletter 1(4): pp. 1-32.
  • ⦁ Saini, J., & Mehra, R., (2015) Power spectral density analysis of speech signal using window techniques. International Journal of Computer Applications, 131(14), 33-36.
  • ⦁ Lie Lu, Hong-Jiang Zhang, and Hao Jiang, (2002) Content analysis for audio classification and segmentation, IEEE Transactions on Speech and Audio Processing, 10( 7): pp. 504–516.
  • ⦁ K. K. Paliwal, (1997) Spectral subband centroids as features for speech recognition, in IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings, pp. 124–131.
  • ⦁ Kua JM, Thiruvaran T, Nosratighods M, Ambikairajah E, Epps J., (2010) Investigation of spectral centroid magnitude and frequency for speaker recognition, In Odyssey-2010, paper 007.