Unsupervised deep feature embeddings for speaker diarization

Unsupervised deep feature embeddings for speaker diarization

Speaker diarization aims to determine “who spoke when?” from multispeaker recording environments. Inthis paper, we propose to learn a set of high-level feature representations, referred to as feature embeddings, from anunsupervised deep architecture for speaker diarization. These sets of embeddings are learned through a deep autoencodermodel when trained on mel-frequency cepstral coefficients (MFCCs) of input speech frames. Learned embeddings are thenused in Gaussian mixture model based hierarchical clustering for diarization. The results show that these unsupervisedembeddings are better compared to MFCCs in reducing the diarization error rate. Experiments conducted on the popularsubset of the AMI meeting corpus consisting of 5.4 h of recordings show that the new embeddings decrease the averagediarization error rate by 2.96%. However, for individual recordings, maximum improvement of 8.05% is acquired.

___

  • [1] Anguera X, Bozonnet S, Evans N, Fredouille C, Friedland G. Speaker diarization: a review of recent research. IEEE Transactions on Audio, Speech, and Language Processing 2010; 20 (2): 356-370.
  • [2] Han KJ, Narayanan SS. Agglomerative hierarchical speaker clustering using incremental Gaussian mixture cluster modeling. In: Interspeech; Brisbane, Australia; 2008. pp. 20–23.
  • [3] Chen S, Gopalakrishnan P. Speaker, environment and channel change detection and clustering via the bayesian information criterion. In: Proceedings of DARPA Broadcast News Transcription and Understanding Workshop; Lansdowne, VA, USA; 1998. pp. 127–132.
  • [4] Dehak N, Kenny PJ, Dehak R, Dumouchel P, Ouellet P. Front-end factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language Processing 2011; 19 (4): 788–798. doi: 10.1109/TASL.2010.2064307
  • [5] Madikeri S, Himawan I, Motlicek P, Ferras M. Integrating online i-vector extractor with information bottleneck based speaker diarization system. In: Interspeech; Dresden, Germany; 2015. pp. 3105–3109.
  • [6] Xu Y, Mcloughlin I, Song Y. Improved i-vector representation for speaker diarization. Circuits, Systems and Signal Process 2016; 35 (9): 3393–3404.
  • [7] Sell G, Garcia-Romero D. Speaker diarization with PLDA i-vector scoring and unsupervised calibration. In: IEEE 2014 Spoken Language Technology Workshop; Tahoe, NV, USA; 2014. pp. 413–417.
  • [8] Wang Q, Downey C, Wan Li, Mansfield PA, Moreno IL. Speaker diarization with lstm. In: IEEE 2018 International Conference on Acoustics, Speech and Signal Process; Calgary, Canada; 2018. pp. 5239–5243.
  • [9] Solomonoff A, Mielke A, Schmidt M, Gish H. Clustering speakers by their voices. In: Proceedings of the IEEE 1998 International Conference on Acoustics, Speech and Signal Processing; Seattle, WA, USA; 1998. pp. 757–760.
  • [10] Wooters C, Fung J, Peskin B, Anguera X. Towards robust speaker segmentation: the ICSI-SRI fall 2004 diarization system. In: RT-04F Workshop; Palisades, NY, USA; 2004. pp. 23-28.
  • [11] Anguera X, Wooters C, Peskin B, Aguiló M. Robust speaker segmentation for meetings: the ICSI-SRI spring 2005 diarization system. In: Renals S, Bengio S (editors). Machine Learning for Multimodal Interaction. Lecture Notes in Computer Science. Vol. 3869. Berlin, Germany: Springer, 2005, pp. 402-414.
  • [12] Wooters C, Huijbregts M. The ICSI RT07s speaker diarization system. In: Stiefelhagen R, Bowers R, Fiscus J (editors). Multimodal Technologies for Perception of Humans. Lecture Notes in Computer Science. Vol. 4625. Berlin, Germany: Springer, 2007, pp. 509-519.
  • [13] Zeiler MD, Fergus R. Visualizing and understanding convolutional networks. In: Fleet D, Pajdla T, Schiele B, Tuytelaars T (editors). Computer Vision - ECCV 2014. Lecture Notes in Computer Science. Vol. 8689. Berlin, Germany: Springer, 2014, pp. 818-833.
  • [14] Jati A, Georgiou P. Speaker2Vec: Unsupervised learning and adaptation of a speaker manifold using deep neural networks with an evaluation on speaker segmentation. In: Interspeech; Stockholm, Sweden; 2017. pp. 3567–3571.
  • [15] Snyder D, Ghahremani P, Povey D, Garcia-Romero D, Carmiel Y et al. Deep neural network-based speaker embeddings for end-to-end speaker verification. In: IEEE 2016 Spoken Language Technology Workshop; San Diego, CA, USA; 2016. pp. 165–170.
  • [16] Dey S, Koshinaka T, Motlicek P, Madikeri S, Lausanne D. DNN based speaker embedding using content information for text-dependent speaker verification. In: IEEE 2018 International Conference on Acoustics, Speech and Signal Processing; Calgary, Canada; 2018. pp. 5344–5348.
  • [17] Wan L, Wang Q, Papir A, Moreno IL. Generalized end-to-end loss for speaker verification. In: IEEE 2018 International Conference on Acoustics, Speech and Signal Processing; Calgary, Canada; 2018. pp. 4879–4883.
  • [18] Zhang A, Wang Q, Zhu Z, Paisley J, Wang C. Fully supervised speaker diarization. arXiv: 1810.04719v7, 2018.
  • [19] Cyrta P, Trzciński T, Stokowiec W. Speaker diarization using deep recurrent convolutional neural networks for speaker embeddings. In: Borzemski L, Światek J, Wilimowska Z (editors). Information Systems Architecture and Technology: Proceedings of 38th International Conference on Information Systems Architecture and Technology - ISAT 2017. Advances in Intelligent Systems and Computing, Vol. 655. Berlin, Germany: Springer, 2017, pp. 107-117.
  • [20] Garcia-Romero D, Snyder D, Sell G, Povey D, McCree A. Speaker diarization using deep neural network embeddings. In: IEEE 2017 International Conference on Acoustics, Speech and Signal Processing; New Orleans, LA, USA; 2017. pp. 4930–4934.
  • [21] Rouvier M, Bousquet PM, Favre B. Speaker diarization through speaker embeddings. In: 23rd European Signal Processing Conference; Nice, France; 2015. pp. 2082–2086.
  • [22] Sell G, Snyder D, McCree A, Romero DG, Villalba J et al. Diarization is hard: some experiences and lessons learned for the JHU team in the inaugural DIHARD challenge. In: Proceedings of Interspeech; Hyderabad, India; 2018. pp. 2808–2812.
  • [23] Molau S, Pitz M, Schluter R, Ney H. Computing mel-frequency cepstral coefficients on the power spectrum. In: IEEE 2001 International Conference on Acoustics, Speech, and Signal Processing; Salt Lake City, UT, USA; 2001. pp. 73–76.
  • [24] Gonina E, Friedland G, Cook H, Keutzer K. Fast speaker diarization using a high-level scripting language. In: IEEE 2011 Workshop on Automatic Speech Recognition & Understanding; Waikoloa, HI, USA; 2011. pp. 553–558.
  • [25] Xu Y, Huang Q, Wang W, Foster P, Sigtia S et al. Unsupervised feature learning based on deep models for environmental audio tagging. IEEE/ACM Transactions on Audio, Speech, and Language Processing 2017; 25 (6): 1230–1241.
  • [26] Zhang S, Bao Y, Zhou P, Jiang H, Dai L. Improving deep neural networks for LVCSR using dropout and shrinking structure. In: IEEE 2014 International Conference on Acoustics, Speech and Signal Processing; Florence, Italy; 2014. pp. 6849–6853.
  • [27] Salakhutdinov RR, Hinton GE. Reducing the dimensionality of data with neural networks. Science 2006; 313 (5786): 504–507. doi: 10.1126/science.1127647
  • [28] McFee B, Raffel C, Liang D, Ellis DPW, McVicar M et al. Librosa: Audio and music signal analysis in python. In: Proceedings of the 14th Python in Science Conference; Austin, TX, USA; 2015. pp. 18–25.
  • [29] Abadi M, Barham P, Chen J, Chen Z, Davis A et al. Tensorflow: A system for large-scale machine learning. In: Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation; Berkeley, CA, USA; 2016. pp. 265-283.
  • [30] Bredin H. pyannote.metrics: A toolkit for reproducible evaluation, diagnostic, and error analysis of speaker diarization systems. In: Proceedings of Interspeech; Stockholm, Sweden; 2017. pp. 3587-3591.
  • [31] Galibert O. Methodologies for the evaluation of speaker diarization and automatic speech recognition in the presence of overlapping speech. In: Interspeech; Lyon, France; 2013. pp. 1131–1134.
  • [32] Carletta J, Ashby S, Bourban S, Flynn M, Guillemot M et al. The AMI Meeting Corpus: a pre-announcement. In: Renals S, Bengio S (editors). Machine Learning for Multimodal Interaction. Lecture Notes in Computer Science. Vol. 3869. Berlin, Germany: Springer, pp. 28-39.