Ergün YÜCESOY

MFKK Özniteliklerine Eklenen Logaritmik Enerji ve Delta Parametrelerinin Yaş ve Cinsiyet Sınıflandırma Üzerindeki Etkileri

Konuşmacıların yaş ve cinsiyet gruplarının otomatik olarak belirlenmesi önemli bir araştırma konusudur ve başta çağrı merkezleri olmak üzere birçok alanda farklı amaçlarla kullanılmaktadır. Bu çalışmada Mel Frekansı Kepstrum Katsayılarına (MFKK) eklenen logaritmik enerji ve delta parametrelerinin otomatik yaş ve cinsiyet tanıma üzerindeki etkileri araştırılmıştır. Konuşma sinyallerinden çıkarılan MFKK öznitelikleri, Gauss Karışım Modeli (GKM) süpervektörlerine dönüştürüldükten sonra Destek Vektör Makinesine (DVM) uygulanmış ve gerçekleştirilen optimizasyon süreci sonunda konuşmacıların yaş ve cinsiyet gruplarına karar verilmiştir. Çalışmada MFKK’ya eklenen parametrelerin yanı sıra MFKK sayısının ve GKM bileşen sayısının başarı üzerindeki etkileri de araştırılmıştır. MFKK sayısı 8 ile 20, GKM bileşen sayısı ise 32 ile 256 arasında değiştirilerek sistem üzerinde testler yapılmıştır. aGender veritabanının geliştirme bölümündeki 299 konuşmacının 1388 konuşması ile yapılan testlerde en yüksek sınıflandırma oranı, 12 kepstral katsayıya logaritmik enerji, delta ve delta-delta parametrelerinin eklenmesi sonucunda %60.23 olarak hesaplanmıştır. Çalışmada optimum GKM bileşen sayısı 128 olarak belirlenirken, logaritmik enerji, delta ve delta-delta parametrelerinin başarı üzerindeki etkileri sırasıyla %1.17, %3.24 ve %4.61 olarak saptanmıştır.

Anahtar Kelimeler:

Yaş ve cinsiyet sınıflandırma, Konuşma işleme, Destek vektör makineleri, Gauss karışım modeli

Effect of Inclusion of Delta Derivatives and Log Energy to MFCC Features on Age and Gender Classification

Automatic recognition of the age and gender groups of the speakers is an important research topic and is used for different purposes in many fields, especially in call centers. In this study, the effects of logarithmic energy and delta parameters added to Mel Frequency Cepstral Coefficients (MFCC) on automatic age and gender recognition were investigated. After transforming the MFCC features extracted from speech signals into Gaussian Mixture Model (GMM) supervectors, they were applied to the Support Vector Machine (DVM) and the age and gender groups of the speakers were decided at the end of the optimization process. In the study, besides the parameters added to MFCC, the effects of MFCC number and GMM component number on success were also investigated. MFCC number was changed between 8 and 20 and GMM component number was changed between 32 and 256 and tests were performed on the system. In tests performed with 1388 speeches of 299 speakers in the development section of aGender database, the highest classification rate was calculated as 60.23% by adding logarithmic energy, delta and delta-delta parameters to 12 cepstral coefficients. In the study, the optimum GMM component number was determined as 128, while the effects of logarithmic energy, delta and delta-delta parameters on success were 1.17%, 3.24% and 4.61%, respectively.

Keywords:

Age and gender classification, Speech processing, Support Vector Machine, Gauss Mixture Model,

PDF

___

Bahari MH, McLaren M, van Leeuwen DA, 2014. Speaker age estimation using i-vectors. Engineering Applications of Artificial Intelligence, 34: 99-108.
Bocklet T, Maier A, Bauer JG, Burkhardt F, Noth E, 2008. Age and gender recognition for telephone applications based on gmm supervectors and support vector machines. In 2008 IEEE International Conference on Acoustics, Speech and Signal Processing, Las Vegas, 31 March-4 April, 2008, pp: 1605-1608.
Campbell, WM, Sturim DE, Reynolds DA, 2006. Support vector machines using GMM supervectors for speaker verification. IEEE signal processing letters, 13(5): 308-311.
Choukri M,Wu S, 2019. Age and Gender Classification for Permission Control of Mobile Devices in Tracking Systems. In International Conference on Artificial Intelligence for Communications and Networks, Harbin, May 25-26, 2019, pp: 318-324.
Dempster A, Laird N, Rubin D, 1977. Maximum likelihood from incomplete data via the EM algorithm, J. Roy. Stat. Soc. 39:1–38.
Dhonde SB, Chaudhari A, Jagade SM, 2017. Integration of mel-frequency cepstral coefficients with log energy and temporal derivatives for text-independent speaker identification. In Proceedings of the International Conference on Data Engineering and Communication Technology, 2017: pp: 791-797
Ertam F, 2019. An effective gender recognition approach using voice data via deeper LSTM networks. Applied Acoustics, 156: 351-358.
Fang SH, Tsao Y, Hsiao MJ, Chen JY, Lai YH, Lin FC, Wang CT, 2019. Detection of pathological voice using cepstrum vectors: A deep learning approach. Journal of Voice, 33(5): 634-641.
Grzybowska J, Kacprzak S, 2016. Speaker Age Classification and Regression Using i-Vectors. In INTERSPEECH 2016, San Francisco, September 8–12, 2016, pp: 1402-1406.
Kerkeni L, Serrestou Y, Mbarki M, Raoof K, Mahjoub, MA, 2018. Speech Emotion Recognition: Methods and Cases Study. In ICAART, January 16-18, 2018, pp: 175-182.
Koo H, Jeong S, Yoon S, Kim W, 2020. Development of Speech Emotion Recognition Algorithm using MFCC and Prosody. In 2020 International Conference on Electronics, Information, and Communication (ICEIC), January 19-22, 2020, pp: 1-4.
Li M, Han KJ, Narayanan S, 2013. Automatic speaker age and gender recognition using acoustic and prosodic level information fusion. Computer Speech & Language, 27(1): 151-167.
Mallouh AA, Qawaqneh Z, Barkana BD, 2017. Combining two different DNN architectures for classifying speaker’s age and gender. In International Conference on Bio-inspired Systems and Signal Processing, Porto, February 21-23, 2017, pp: 112-117.
Meinedo H, Trancoso I, 2010. Age and gender classification using fusion of acoustic and prosodic features. In Eleventh Annual Conference of the International Speech Communication Association, Makuhari, September 26-30, 2010, pp: 2818-2821.
Metze F, Ajmera J, Englert R, Bub U, Burkhardt F, Stegmann J, Müller C, Huber R, Andrassy B, Bauer JG, Littel B, 2007. Comparison of four approaches to age and gender recognition for telephone applications. In 2007 IEEE International Conference on Acoustics, Speech and Signal Processing-ICASSP'07, Honolulu, April 15-20, 2007, pp: IV-1089-IV-1092I.
Rabiner L, Juang BH, Yegnanarayana B, 2008. Fundamentals of Speech Recognition, Pearson Education, London.
Rao KS, Manjunath KE, 2017. Speech recognition using articulatory and excitation source features. Springer. (Appendix A MFCC Features)
Reynolds DA, Quatieri TF, Dunn RB, 2000. Speaker verification using adapted Gaussian mixture models. Digital Signal Processing, 10(1-3), 19-41.
Safavi S, Russell M, Jančovič P, 2018. Automatic speaker, age-group and gender identification from children’s speech. Computer Speech & Language, 50: 141-156.
Son G, Kwon S, Park N, 2019. Gender classification based on the non-lexical cues of emergency calls with recurrent neural networks (RNN). Symmetry, 11(4): 525.
van Heerden C, Barnard E, Davel M, van der Walt C, van Dyk E, Feld M, Müller C, 2010. Combining regression and classification methods for improving automatic speaker age recognition. In 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, Dallas, March 14-19, 2010, pp: 5174-5177.
Yücesoy E, 2020. Konuşmacının Yaş ve Cinsiyetine Göre Sınıflandırılmasında DVM Çekirdeğinin Etkisi. El-Cezeri Journal of Science and Engineering, 7(3):970-982.
Zazo R, Nidadavolu PS, Chen N, Gonzalez-Rodriguez J, Dehak N, 2018. Age estimation in short speech utterances based on LSTM recurrent neural networks. IEEE Access, 6: 22524-22530.