Model Kompanzasyonlu Birinci Derece İstatistikleri ile i-vektörlerin Gürbüzlüğünün Artırılması

Konuşmacı tanıma sistemleri özellikle i-vektörlerin performansı sebebiyle son on yılda önemli gelişmeler elde etmiştir. Bu gelişmelere rağmen eğitim ve test verileri arasındaki uyumsuzluk tanıma performansını önemli ölçüde etkilemektedir. Bu çalışmada, model kompanzasyon yöntemleri i-vektör çıkarımı şemasına eklenerek toplanabilir gürültülere karşı gürbüzlüğü artıracak bir çözüm sunulmaktadır. Durağan gürültüler için model kompanzasyon teknikleri oldukça gürbüz sistemler üretir. Paralel Model Kompanzasyonu ve Vektör Taylor Serileri en gelişmiş model kompanzasyon tekniklerinden kabul edilmektedir. Bu metotlar birinci dereceden istatistiklere uygulanarak toplanabilir gürültülerden kaynaklanan uyumsuzluğu azaltacak gürültülü tüm değişkenlik uzayı eğitimi amaçlanmıştır. Tüm değişkenlik matrisin eğitimi, i-vektör boyutunun azaltılması, i-vektörlerin puanlanması gibi geleneksel i-vektör şemasının diğer tüm parçaları değişmeden kalmaktadır. Önerilen yöntem, 6 dB’lik adımlarla -6 dB’den 18 dB’ye kadar çeşitli sinyal-gürültü oranlarına (SNR) sahip dört farklı gürültü tipi ile test edilmiştir. Her iki yöntemle de en düşük SNR seviyelerinde bile eşit hata oranlarında yüksek azalmalar elde edilmiştir. Önerilen yaklaşım eşik hata oranında ortalama olarak %50’den fazla göreceli azalma sağlamıştır.

Anahtar Kelimeler:

Paralel model kompanzasyonu, Gürbüz konuşmacı tanıma, Vektör Taylor serileri, I-vektör

Increasing the Robustness of i-vectors with Model Compensated First Order Statistics

Speaker recognition systems achieved significant improvements over the last decade, especially due to the performance of the i-vectors. Despite the achievements, mismatch between training and test data affects the recognition performance considerably. In this paper, a solution is offered to increase robustness against additive noises by inserting model compensation techniques within the i-vector extraction scheme. For stationary noises, the model compensation techniques produce highly robust systems. Parallel Model Compensation and Vector Taylor Series are considered as state-of-the-art model compensation techniques. Applying these methods to the first order statistics, a noisy total variability space training is aimed, which will reduce the mismatch resulted by additive noises. All other parts of the conventional i-vector scheme remain unchanged, such as total variability matrix training, reducing the i-vector dimensionality, scoring the i-vectors. The proposed method was tested with four different noise types with several signal to noise ratios (SNR) from -6 dB to 18 dB with 6 dB steps. High reductions in equal error rates were achieved with both methods, even at the lowest SNR levels. On average, the proposed approach produced more than 50% relative reduction in equal error rate.

Keywords:

Parallel model compensation, Robust speaker recognition, Vector Taylor series, I-vector,

PDF

___

Acero, A., Deng, L., Kristjansson, T., & Zhang, J. 2000. HMM Adaptation Using Vector Taylor Series for Noisy Speech Recognition. In Sixth International Conference on Spoken Language Processing (pp. 869–872). Beijing, China.
Baby, R., Kumar, C. S., George, K. K., & Panda, A. 2017. Noise compensation in i-vector space using linear regression for robust speaker verification. In 2017 International Conference on Multimedia, Signal Processing and Communication Technologies (IMPACT) (pp. 161–165). Aligarh, India: IEEE. https://doi.org/10.1109/MSPCT.2017.8363996
Bellot, O., Matrouf, D., Merlin, T., & Bonastre, J.-F. 2000. Additive and Convolutional Noises Compensation for Speaker Recognition. In Sixth International Conference on Spoken Language Processing (pp. 799–802). Beijing, China.
Ben Kheder, W., Matrouf, D., Bonastre, J.-F., Ajili, M., & Bousquet, P.-M. 2015. Additive noise compensation in the i-vector space for speaker recognition. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4190–4194). Brisbane, QLD, Australia.
Ben Kheder, W., Matrouf, D., Bousquet, P.-M., Bonastre, J.-F., & Ajili, M. 2014. Robust Speaker Recognition Using MAP Estimation of Additive Noise in i-vectors Space. In International Conference on Statistical Language and Speech Processing (pp. 97–107). Grenoble, France. Ben Kheder, W., Matrouf, D., Bousquet, P.-M., Bonastre, J.-F., & Ajili, M. 2017. Fast i-vector denoising using MAP estimation and a noise distributions database for robust speaker recognition. Computer Speech & Language, 45, 104–122.
Chung, Y. 2016. Vector Taylor series based model adaptation using noisy speech trained hidden Markov models. Pattern Recognition Letters, 75, 36–40.
Chuwatthananurux, S., & Wanvarie, D. 2016. Improving noise estimation with RAPT pitch voice activity detection under low SNR condition. In 2016 8th International Conference on Knowledge and Smart Technology (KST) (pp. 77–82). Chiangmai, Thailand.
Das, B., & Panda, A. 2016. Vector taylor series expansion with auditory masking for noise robust speech recognition. In 2016 10th International Symposium on Chinese Spoken Language Processing (ISCSLP) (pp. 1–5). Tianjin, China.
Davis, S., & Mermelstein, P. 1980. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech, and Signal Processing, 28(4), 357–366.
Dehak, N., Kenny, P. J., Dehak, R., Dumouchel, P., & Ouellet, P. 2011. Front-End Factor Analysis for Speaker Verification. IEEE Transactions on Audio, Speech, and Language Processing, 19(4), 788–798.
Dişken, G., Tüfekci, Z., & Çevik, U. 2017. A robust polynomial regression-based voice activity detector for speaker verification. EURASIP Journal on Audio, Speech, and Music Processing, 2017(1), 1-23.
Dişken, G., Tüfekçi, Z., Saribulut, L., & Çevik, U. 2017. A Review on Feature Extraction for Speaker Recognition under Degraded Conditions. IETE Technical Review, 34(3), 321–332.
El Ayadi, M., S.O. Hassan, A.-K., Abdel-Naby, A., & A. Elgendy, O. 2017. Text-independent speaker identification using robust statistics estimation. Speech Communication, 92, 52–63. https://doi.org/10.1016/j.specom.2017.05.005
Gales, M.J.F. 1997. “NICE” Model-Based Compensation Schemes for Robust Speech Recognition. In Robust Speech Recognition for Unknown Communication Channels (pp. 55–64). Pont-a-Mousson, France. Gales, M.J.F., & Young, S. J. 1993. Cepstral parameter compensation for HMM recognition in noise. Speech Communication, 12(3), 231–239.
Gales, M. J. F., & Young, S. J. 1995. A fast and flexible implementation of parallel model combination. In 1995 International Conference on Acoustics, Speech, and Signal Processing (Vol. 1, pp. 133–136). Detroit, USA. Gales, M. J. F., & Young, S. J. 1996. Robust continuous speech recognition using parallel model combination. IEEE Transactions on Speech and Audio Processing, 4(5), 352–359.
Gao, Z., Bao, C., Bao, F., & Jia, M. 2014. HMM-based speech enhancement using vector Taylor series and parallel modeling in Mel-frequency domain. In 2014 IEEE International Conference on Signal Processing, Communications and Computing (ICSPCC) (pp. 733–737). Guilin, China.
Garcia-Romero, D., Zhou, X., Espy-Wilson, C. Y. 2012. Multicondition training of Gaussian PLDA models in i-vector space for noise and reverberation robust speaker recognition. In 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4257–4260). Kyoto, Japan.
Geng-Xin N., Shu-Hung L., Kam-Keung C., Gang W. 2006. A parallel model combination scheme with improved delta parameter compensation. In 2006 IEEE International Symposium on Circuits and Systems (pp. 5535–5538). Island of Kos, Greece: IEEE. https://doi.org/10.1109/ISCAS.2006.1693888
Ghosh, P. K., Tsiartas, A., Narayanan, S. 2011. Robust Voice Activity Detection Using Long-Term Signal Variability. IEEE Transactions on Audio, Speech, and Language Processing, 19(3), 600–613.
Gong, Y. 2002. A COMPARATIVE STUDY OF APPROXIMATIONS FOR PARALLEL MODEL COMBINATION OF STATIC AND DYNAMIC PARAMETERS. In 7th International Conference on Spoken Language Processing (pp. 1–4). Denver, Colorado, USA.
Guo, J., Xu, N., Qian, K., Shi, Y., Xu, K., Wu, Y., Alwan, A. 2018. Deep neural network based i-vector mapping for speaker verification using short utterances. Speech Communication, 105, 92–102.
Jinyu, L., Li D., Dong, Y., Yifan, G., Acero, A. 2007. High-performance hmm adaptation with joint compensation of additive and convolutive distortions via Vector Taylor Series. In 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU) (pp. 65–70). Kyoto, Japan.
Kalinli, O., Seltzer, M.L., Droppo, J., Acero, A. 2010. Noise Adaptive Training for Robust Automatic Speech Recognition. IEEE Transactions on Audio, Speech, and Language Processing, 18(8), 1889–1901.
Kalinli, O., Seltzer, M. L., Acero, A. 2009. Noise adaptive training using a vector taylor series approach for noise robust automatic speech recognition. In 2009 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 3825–3828). Taipei, Taiwan.
Kenny, P. 2012. A Small Footprint i-Vector Extractor. In Odyssey 2012-The Speaker and Language Recognition Workshop (pp. 1–6). Singapore.
Kenny, P., Boulianne, G., Ouellet, P., Dumouchel, P. (2007). Speaker and Session Variability in GMM-Based Speaker Verification. IEEE Transactions on Audio, Speech and Language Processing, 15(4), 1448–1460.
Kheder, W. Ben, Matrouf, D., Ajili, M., Bonastre, J.-F. 2018. A Unified Joint Model to Deal With Nuisance Variabilities in the i-Vector Space. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(3), 633–645.
Kim, W., Hansen, J.H.L. 2009. Feature compensation in the cepstral domain employing model combination. Speech Communication, 51(2), 83–96.
Kinnunen, T., Li, H. 2010. An overview of text-independent speaker recognition: From features to supervectors. Speech Communication, 52(1), 12–40.
Krobba, A., Debyeche, M., Selouani, S.-A. 2019. Multitaper chirp group delay Hilbert envelope coefficients for robust speaker verification. Multimedia Tools and Applications, 78(14), 19525–19542.
Lei, Y., Burget, L., Ferrer, L., Graciarena, M., Scheffer, N. 2012. Towards noise-robust speaker recognition using probabilistic linear discriminant analysis. In 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4253–4256). Kyoto, Japan.
Lei, Y., Burget, L., Scheffer, N. 2013. A noise robust i-vector extractor using vector taylor series for speaker recognition. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 6788–6791). Vancouver, BC, Canada.
Lei, Y., McLaren, M., Ferrer, L., Scheffer, N. 2014. Simplified VTS-based I-vector extraction in noise-robust speaker recognition. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4037–4041). Florence, Italy.
Lei, Y., Scheffer, N., Ferrer, L., McLaren, M. 2014. A novel scheme for speaker recognition using a phonetically-aware deep neural network. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 1695–1699). Florence, Italy.
Li, M., Narayanan, S. 2014. Simplified supervised i-vector modeling with application to robust and efficient language identification and speaker verification. Computer Speech and Language, 28(4), 940–958.
Li, N., Mak, M.W. 2015) SNR-Invariant PLDA Modeling in Nonparametric Subspace for Robust Speaker Verification. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 23(10), 1648–1659. 7
Li, N., Mak, M.W., Chien, J.-T. 2016. Deep neural network driven mixture of PLDA for robust i-vector speaker verification. In 2016 IEEE Spoken Language Technology Workshop (SLT) (pp. 186–191). San Diego, CA, USA.
Li, N., Mak, M.-W., Chien, J.T. 2017. DNN-Driven Mixture of PLDA for Robust Speaker Verification. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 25(6), 1371–1383.
Lin, Z., Goubran, R. A., Dansereau, R. M. 2007. Noise estimation using speech/non-speech frame decision and subband spectral tracking. Speech Communication, 49(7), 542–557.
Lit Ping Wong, Russell, M. 2001. Text-dependent speaker verification under noisy conditions using parallel model combination. In 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Vol. 1, pp. 457–460). Salt Lake City, UT, USA.
Liu, G., Hansen, J.H.L. 2014. An Investigation into Back-end Advancements for Speaker Recognition in Multi-Session and Noisy Enrollment Scenarios. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(12), 1978–1992.
Mahto, S., Yamamoto, H., Koshinaka, T. 2017. i-Vector Transformation Using a Novel Discriminative Denoising Autoencoder for Noise-Robust Speaker Recognition. In Interspeech 2017 (pp. 3722–3726). Stockholm, Sweden.
Mak, M.W. 2014. SNR-Dependent Mixture of PLDA for Noise Robust Speaker Verification. In INTERSPEECH 2014 (pp. 1855–1859). Singapore.
Mak, M.W., Pang, X., Chien, J.T. 2016. Mixture of PLDA for Noise Robust I-Vector Speaker Verification. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 24(1), 130–142.
Martin, R. 2001. Noise power spectral density estimation based on optimal smoothing and minimum statistics. IEEE Transactions on Speech and Audio Processing, 9(5), 504–512.
Martinez, D., Burget, L., Stafylakis, T., Lei, Y., Kenny, P., Lleida, E. 2014. Unscented transform for ivector-based noisy speaker recognition. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4042–4046). Florence, Italy.
McLaren, M., Lei, Y., Scheffer, N., Ferrer, L. 2014. Application of convolutional neural networks to speaker recognition in noisy conditions. In INTERSPEECH 2014 (pp. 686–690). Singapore.
Ming, J. 2007. Robust Speaker Recognition in Noisy Conditions. IEEE Transactions on Audio, Speech and Language Processing, 15(5), 637–1723.
Moreno, P. J., Raj, B., Stern, R. M. 1996. A vector Taylor series approach for environment-independent speech recognition. In 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings (Vol. 2, pp. 733–736). Atlanta, GA, USA.
Novotný, O., Plchot, O., Glembek, O., Černocký, J. Honza, Burget, L. 2019. Analysis of DNN Speech Signal Enhancement for Robust Speaker Recognition. Computer Speech & Language, 58, 403–421.
Rajan, P., Kinnunen, T., Hautamäki, V. 2013. Effect of Multicondition Training on i-Vector PLDA Configurations for Speaker Recognition. In INTERSPEECH 2013 (pp. 3694–3697). Lyon, France.
Reynolds, D.A., Quatieri, T.F., Dunn, R.B. 2000. Speaker Verification Using Adapted Gaussian Mixture Models. Digital Signal Processing, 10(3), 19–41.
Ribas, D., Vincent, E. 2019. An Improved Uncertainty Propagation Method for Robust I-Vector Based Speaker Recognition. In 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6331–6335). Brighton, UK.
Sarkar, S., Sreenivasa R.K. 2014. A Novel Boosting Algorithm for Improved i-Vector based Speaker Verification in Noisy Environments. In INTERSPEECH 2014 (pp. 671–675). Singapore.
Sim, K.C. 2013. Approximated Parallel Model Combination for efficient noise-robust speech recognition. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 7383–7387). Vancouver, BC, Canada.
Sim, K.C., Luong, M.T. 2011. A Trajectory-based Parallel Model Combination with a unified static and dynamic parameter compensation for noisy speech recognition. In 2011 IEEE Workshop on Automatic Speech Recognition & Understanding (pp. 107–112). Waikoloa, HI, USA.
Snyder, D., Ghahremani, P., Povey, D., Garcia-Romero, D., Carmiel, Y., Khudanpur, S. 2016. Deep neural network-based speaker embeddings for end-to-end speaker verification. In 2016 IEEE Spoken Language Technology Workshop (SLT) (pp. 165–170). San Diego, CA, USA.
Tao, Y., Li, X., Wu, B. 2008. An Effective PCM Based Environment Compensation Approach in Speech Processing for Mobile e-Learning Platform. In 2008 Third International Conference on Pervasive Computing and Applications (pp. 772–775). Alexandria, Egypt.
Tirumala, S. S., Shahamiri, S. R., Garhwal, A. S., & Wang, R. (2017). Speaker identification features extraction methods: A systematic review. Expert Systems with Applications, 90, 250–271.
Tufekci, Z., Gowdy, J.N., Gurbuz, S., Patterson, E. 2006. Applied mel-frequency discrete wavelet coefficients and parallel model compensation for noise-robust speech recognition. Speech Communication, 48(10), 1294–1307.
Varga, A., Steeneken, H.J.M. 1993. Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems. Speech Communication, 12(3), 247–251.
Variani, E., Lei, X., McDermott, E., Moreno, I.L., Gonzalez-Dominguez, J. 2014. Deep neural networks for small footprint text-dependent speaker verification. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4052–4056). Florence, Italy.
Wang, S., Huang, Z., Qian, Y., Yu, K. 2018. Deep Discriminant Analysis for i-vector Based Robust Speaker Recognition. In 11th International Symposium on Chinese Spoken Language Processing (ISCSLP) (pp. 195–199). Taipei, Taiwan.
Zhang, X., Zou, X., Sun, M., Wu, P., Wang, Y., He, J. 2020. On the complementary role of DNN multi-level enhancement for noisy robust speaker recognition in an i-vector framework. IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, E103A(1), 356–360.
Zhang, X., Zou, X., Sun, M., Zheng, T. F., Jia, C., Wang, Y. 2019. Noise Robust Speaker Recognition Based on Adaptive Frame Weighting in GMM for i-Vector Extraction. IEEE Access, 7(2019), 27874–27882.
Zhou, L., Li, H., Chen, Y., Wu, Z., Lu, Y. 2016. VTS feature compensation based on two-layer GMM structure for robust speech recognition. In 2016 8th International Conference on Wireless Communications & Signal Processing (WCSP) (pp. 1–5). Yangzhou, China.