Surekha Reddy BANDELA, T. Kishore KUMAR

Speech emotion recognition using semi-NMF feature optimization

In recent times, much research is progressing forward in the field of speech emotion recognition (SER). ManySER systems have been developed by combining different speech features to improve their performances. As a result, thecomplexity of the classifier increases to train this huge feature set. Additionally, some of the features could be irrelevantin emotion detection and this leads to a decrease in the emotion recognition accuracy. To overcome this drawback, featureoptimization can be performed on the feature sets to obtain the most desirable emotional feature set before classifyingthe features. In this paper, semi-nonnegative matrix factorization (semi-NMF) with singular value decomposition (SVD)initialization is used to optimize the speech features. The speech features considered in this work are mel-frequencycepstral coefficients, linear prediction cepstral coefficients, and Teager energy operator-autocorrelation (TEO-AutoCorr).This work uses k-nearest neighborhood and support vector machine (SVM) for the classification of emotions with a5-fold cross-validation scheme. The datasets considered for the performance analysis are EMO-DB and IEMOCAP. Theperformance of the proposed SER system using semi-NMF is validated in terms of classification accuracy. The resultsemphasize that the accuracy of the proposed SER system is improved remarkably upon using the semi-NMF algorithmfor optimizing the feature sets compared to the baseline SER system without optimization.

PDF

___

[1] El Ayadi M, Kamel MS, Karray F. Survey on speech emotion recognition: Features, classification schemes, and databases. Pattern Recognition 2011; 44 (3): 572-587.
[2] Ververidis D, Kotropoulos C. Emotional speech recognition: resources, features, and methods. Speech Communication 2006; 48 (9): 1162-1181.
[3] Seng KP, Ang LM, Ooi CS. A combined rule-based & machine learning audio-visual emotion recognition approach. IEEE Transitions on Affective Computing 2018; 9 (1): 3-13.
[4] Busso C, Lee S, Narayanan S. Analysis of emotionally salient aspects of fundamental frequency for emotion detection. IEEE Transactions on Audio, Speech, and Language Processing 2009; 17 (4): 582-596.
[5] Yang B, Lugger M, Emotion recognition from speech signals using new harmony features. Signal Processing 2010; 90 (5): 1415-1423. doi: 10.1016/j.sigpro.2009.09.009
[6] Lugger M, Yang B. The relevance of voice quality features in speaker independent emotion recognition. In: ICASSP 2007; Honolulu, HI, USA; 2007. pp. IV18-20.
[7] Nwe TL, Foo SW, De Silva LC. Speech emotion recognition using hidden Markov models. Speech Communication 2003; 41 (4): 603-623.
[8] Nordstrom KI, Tzanetakis G, Driessen PF. Transforming perceived vocal effort and breathiness using adaptive preemphasis linear prediction. IEEE Transactions on Audio, Speech, and Language Processing 2008; 16 (6): 1087-1096.
[9] Kim MJ, Yoo J, Kim Y, Kim H. Speech emotion classification using tree structured sparse logistic regression. In: INTERSPEECH 2015; Dresden, Germany; 2015. pp. 1541-1545.
[10] Ying S, Xue-Ying Z. Characteristics of human auditory model based on compensation of glottal features in speech emotion recognition. Future Generation Computer Systems 2018; 81: 291-296.
[11] Neiberg D, Elenius K, Laskowski K. Emotion recognition in spontaneous speech using GMMs. In: INTERSPEECH 2006; Pittsburgh, PA, USA; 2006. pp. 809-812.
[12] Hu H, Xu MX, Wu W. GMM supervector based SVM with spectral features for speech emotion recognition. In: ICASSP 2007; Honolulu, HI, USA; 2007. pp. IV 413-416.
[13] Wang Y, Hu W. Speech emotion recognition based on improved MFCC. In: Proceedings of the 2nd International Conference on Computer Science and Application Engineering; New York, NY, USA; 2018. doi: 10.1145/3207677.3278037
[14] Teager HM. Some observations on oral air flow during phonation. IEEE Transactions on Acoustics, Speech, and Signal Processing 1980; 28 (5): 599-601.
[15] Kaiser JF. Some useful properties of Teager’s energy operators. In: ICASSP 1993; Minneapolis, MN, USA, USA; 1993. pp. 149-152.
[16] Zhou G, Hansen JHL, Kaiser JF. Classification of speech under stress based on features derived from the nonlinear Teager energy operator. In: ICASSP 1998; Seattle, WA, USA; 1998. pp. 549-552.
[17] Cairns DA, Hansen JHL. Nonlinear analysis and classification of speech under stressed conditions. Journal of the Acoustical Society of America 1994; 96 (6): 3392-3400.
[18] Zhou G, Hansen JHL, Kaiser JF. Nonlinear feature based classification of speech under stress. IEEE Transactions on Speech and Audio Processing 2001; 9 (3): 201-216.
[19] Sun R, Moore E. Investigating glottal parameters and teager energy operators in emotion recognition. Lecture Notes in Computer Science 2011; 6975: 425-434.
[20] Wu K, Zhang D, Lu G. GMAT: Glottal closure instants detection based on the multiresolution absolute Teager– Kaiser energy operator. Digital Signal Processing 2017; 69: 286-299.
[21] Attabi Y, Alam MJ, Dumouchel P, Kenny P, O’Shaughnessy D. Multiple windowed spectral features for emotion recognition. In: ICASSP 2013; Vancouver, Canada; 2013. pp. 7527-7531.
[22] Bou-Ghazale SE, Hansen JHL. A comparative study of traditional and newly proposed features for recognition of speech under stress. IEEE Transactions on Speech and Audio Processing 2000; 8 (4): 429-442.
[23] Liu ZT, Xie Q, Wu X, Cao W, Mei Y et al. Speech emotion recognition based on an improved brain emotion learning model. Neurocomputing 2018; 309: 145-156.
[24] Pohjalainen J, Saeidi R, Kinnunen T, Alku P. Extended weighted linear prediction (XLP) analysis of speech and its application to speaker verification in adverse conditions. In: INTERSPEECH 2010; Makuhari, Japan; 2010. pp. 1477-1480.
[25] Wang K, An N, Li BN, Zhang Y, Li L. Speech emotion recognition using Fourier parameters. IEEE Transactions on Speech and Audio Processing 2015; 6 (1): 69-75.
[26] Zao L, Cavalcante D, Coelho R. Time-frequency feature and AMS-GMM mask for acoustic emotion classification. IEEE Signal Processing Letters 2014; 21 (5): 620-624.
[27] Wu S, Falk TH, Chan WY. Automatic speech emotion recognition using modulation spectral features. Speech Communication 2011; 53 (5): 768-785.
[28] Deb S, Dandapat S. Multiscale amplitude feature and significance of enhanced vocal tract information for emotion classification. IEEE Transactions on Cybernetics 2018; 49 (3): 802-815.
[29] Caponetti L, Buscicchio C, Castellano G. Biologically inspired emotion recognition from speech. EURASIP Journal on Advances in Signal Processing 2011; 24: 1-6.
[30] Domingos P. A few useful things to know about machine learning. Communications of the ACM 2012; 55 (10): 78-87.
[31] Mitchell TM. Machine Learning. New York, NY, USA: McGraw-Hill Education, 1997.
[32] Lee CM, Narayanan SS. Toward detecting emotions in spoken dialogs. IEEE Transactions on Speech and Audio Processing 2005; 13 (2): 293-303.
[33] Wu D, Parsons TD, Narayanan SS. Acoustic feature analysis in speech emotion primitives estimation. In: INTERSPEECH 2010; Makuhari, Japan; 2010. pp. 785-788.
[34] Bartenhagen C, Klein HU, Ruckert C, Jiang X, Dugas M. Comparative study of unsupervised dimension reduction techniques for the visualization of microarray gene expression data. BMC Bioinformatics 2010; 11: 567.
[35] Wall M, Rechtsteiner A, Rocha L. Singular value decomposition and principal component analysis. In: Berrar DP, Dubitzky W, Granzow M (editors). A Practical Approach to Microarray Data Analysis. Berlin, German: Springer, 2003, pp. 91-109.
[36] Zhang S, Li L, Zhao Z. Spoken emotion recognition using kernel discriminant locally linear embedding. Electronics Letters 2010; 46 (19): 1344-1346.
[37] Song P, Ou S, Zheng W, Jin Y, Zhao L. Speech emotion recognition using transfer non-negative matrix factorization. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing; Shanghai, China; 2016. pp. 5180-5184.
[38] Sahu S, Gupta R, Sivaraman G, Abd Almageed W, Espy-Wilson C. Adversarial Auto-encoders for speech based emotion recognition. In: INTERSPEECH 2017; Stockholm, Sweden; 2017. pp. 1243-1247.
[39] Latif S, Rana R, Qadir J, Epps J. Variational autoencoders to learn latent representations of speech emotion. In: Interspeech 2018; Hyderabad, India; 2018. pp. 3107-3111.
[40] Koolagudi SG, Rao KS. Emotion recognition from speech using source, system, and prosodic features. International Journal of Speech Technology 2012; 15 (2): 265-289.
[41] Rao KS, Reddy VR, Maity S. Language Identification Using Spectral and Prosodic Features. Cham, Switzerland: Springer International Publishing, 2015.
[42] Ding C, Li T, Jordan MI. Convex and semi-nonnegative matrix factorizations. IEEE Transactions on Pattern Analysis and Machine Intelligence 2010; 32 (1): 45-55.
[43] Gillis N, Kumar A. Exact and heuristic algorithms for semi-nonnegative matrix factorization. SIAM Journal of Matrix Analytics and Application 2014; 36 (4): 1404-1424.
[44] Burkhardt F, Paeschke A, Rolfes M, Sendlmeier WF, Weiss B. A database of German emotional speech. In: Interspeech 2005; Lisbon, Portugal; 2005. pp. 1-4.
[45] Busso C, Bulut M, Lee C, Kazemzadeh A, Mower E et al. IEMOCAP: Interactive emotional dyadic motion capture database. Journal of Language Resources and Evaluation 2008; 42 (4): 335-359.
[46] Chen S, Wang J, Hsieh W, Chin Y, Ho C et al. Speech emotion classification using multiple kernel Gaussian process. In: APSIPA 2016; Jeju, South Korea; 2016. pp. 1-4.
[47] Zhang S, Zhao X, Lei B. Speech emotion recognition using an enhanced kernel isomap for human-robot interaction. International Journal of Advanced Robotic Systems 2017; 10 (2): 1-7.
[48] Zhang S, Zhao X. Dimensionality reduction-based spoken emotion recognition. Multimedia Tools and Applications 2013; 63 (3): 615–646.
[49] Yan J, Wang X, Gu W, Ma L. Speech emotion recognition based on sparse representation. Archives of Acoustics 2013; 38 (4): 465–470.
[50] Kuchibhotla S, Vankayalapati HD, Anne KR. An optimal two stage feature selection for speech emotion recognition using acoustic features. International Journal of Speech Technology 2016; 19 (4): 657-667.
[51] Daneshfar F, Kabudian SJ. Speech emotion recognition using optimal dimension reduction and non-isotropic Gaussian radial basis function network trained with scaled conjugate gradient descent. Iranian Journal of Electrical and Electronic Engineering of Iran University of Science and Technology (in press).
[52] Gudmalwar AP, Rama Rao CV, Dutta A. Improving the performance of the speaker emotion recognition based on low dimension prosody features vector. International Journal of Speech Technology (in press).
[53] Özseven T. A novel feature selection method for speech emotion recognition. Applied Acoustics 2019; 146: 320–326.
[54] Sun L, Fu S, Wang F. Decision tree SVM model with Fisher feature selection for speech emotion recognition. EURASIP Journal on Audio, Speech, and Music Processing 2019; 2019: 2. doi: 10.1186/s13636-018-0145-5