Hayri SEVER, Hüseyin POLAT, Saadin OYUCU

Sessizliğin Kaldırılması ve Konuşmanın Parçalara Ayrılması İşleminin Türkçe Otomatik Konuşma Tanıma Üzerindeki Etkisi

Otomatik Konuşma Tanıma sistemleri temel olarak akustik bilgiden faydalanılarak geliştirilmektedir. Akustik bilgiden fonem bilgisinin elde edilmesi için eşleştirilmiş konuşma ve metin verileri kullanılmaktadır. Bu veriler ile eğitilen akustik modeller gerçek hayattaki bütün akustik bilgiyi modelleyememektedir. Bu nedenle belirli ön işlemlerin yapılması ve otomatik konuşma tanıma sistemlerinin başarımını düşürecek akustik bilgilerin ortadan kaldırılması gerekmektedir. Bu çalışmada konuşma içerisinde geçen sessizliklerin kaldırılması için bir yöntem önerilmiştir. Önerilen yöntemin amacı sessizlik bilgisinin ortadan kaldırılması ve akustik bilgide uzun bağımlılıklar sağlayan konuşmaların parçalara ayrılmasıdır. Geliştirilen yöntemin sonunda elde edilen sessizlik içermeyen ve parçalara ayrılan konuşma bilgisi bir Türkçe Otomatik Konuşma Tanıma sistemine girdi olarak verilmiştir. Otomatik Konuşma Tanıma sisteminin çıkışında sisteme giriş olarak verilen konuşma parçalarına karşılık gelen metinler birleştirilerek sunulmuştur. Gerçekleştirilen deneylerde sessizliğin kaldırılması ve konuşmanın parçalara ayrılması işleminin Otomatik Konuşma Tanıma sistemlerinin başarımını artırdığı görülmüştür.

Anahtar Kelimeler:

Otomatik konuşma tanıma, Sessizliğin kaldırılması, Konuşmanın parçalanması

The Effect of Removal the Silence and Speech Parsing Processes on Turkish Automatic Speech Recognition

Automatic Speech Recognition systems are mainly developed using acoustic information. Paired speech and text data are used to obtain phoneme information from acoustic information. The acoustic models trained with these data cannot model all acoustic information in real life. For this reason, it is necessary to carry out certain pre-processing and eliminate the acoustic information that will reduce the performance of automatic speech recognition systems. In this study, a method for removing silences in the speech was proposed. The aim of the proposed method is to eliminate silence and to break down conversations that give long dependencies. The speech information, which does not contain any silence and is divided into pieces, is given as an input to the Turkish Automatic Speech Recognition system. In the output of the Automatic Speech Recognition system, the speech that is given as input to the system are presented by combining the corresponding texts. In the experiments carried out, it was seen that the removal of silence and parsing of speech increased the performance of Automatic Speech Recognition systems.

Keywords:

Automatic speech recognition, Silence removal,

PDF

___

[1] M. Abushariah, S. Gunawan, O. Khalifa, ve M. Abushariah, “English digits speech recognition system based on Hidden Markov Models,” International Conference on Computer and Communication Engineering, Kuala Lumpur, Malaysia, 2010, ss. 1–5.
[2] H. Prakoso, R. Ferdiana, ve R. Hartanto, “Indonesian Automatic Speech Recognition system using CMUSphinx toolkit and limited dataset,” International Symposium on Electronics and Smart Devices, Bandung, Indonesia, 2016, ss. 283–286.
[3] C. Kurian, ve K. Balakrishnan, “Speech recognition of Malayalam numbers,” World Congress Natural Biology Inspired Compututer, Coimbatore, India, 2009, ss. 1475–1479.
[4] C. Howard, ve D. David, “Automatic Measurement of Speech Recognition Performance: A Comparison of Six Speaker-Dependent Recognition Devices,” Computer Speech & Language, c. 2, s. 2, ss. 87-108, 1987.
[5] D. Amodei, “Deep speech 2:end-to-end speech recognition in english and mandarin,” International Conference on International Conference on Machine Learning, New York, USA, 2006, ss. 1–28.
[6] Y. G. Thimmaraja ve H. S. Jayanna, “Creating language and acoustic models using Kaldi to build an automatic speech recognition system for Kannada language,” International Conference on Recent Trends in Electronics, Information & Communication Technology, Bangalore, India, 2017, ss. 161–165.
[7] E. Bocchieri, ve D. Caseiro, “Use of geographical meta-data in ASR language and acoustic models,” International Conference on Acoustics, Speech and Signal Processing, Dallas, TX, USA, 2010, ss. 5118–5121.
[8] J. Neto, “Speaker-adaptation for hybrid HMM-ANN continuous speech recognition system,” European Conference on Speech Communication and Technology, Madrid, Spain, 1995, ss. 2171–2174.
[9] G. Hinton, “Deep Neural Networks for Acoustic Modeling in Speech Recognition,” Signal Processing Magazine, c. 29, s. 6, ss. 82–97, 2012.
[10] W. Chan, ve I. Lane, “Deep convolutional neural networks for acoustic modeling in low resource languages,” International Conference on Acoustics, Speech and Signal Processing, Brisbane, QLD, Australia, 2015, ss. 2056–2060.
[11] C. Ris, ve S. Dupont, “Assessing local noise level estimation methods: Application to noise robust ASR,” Speech Communication, c. 34, s. 1, ss. 141-158, 2001.
[12] C. Guoguo, X. Hainan, W. Minhua, P. Daniel, ve K. Sanjeev, “Pronunciation and silence probability modeling for ASR,” Annual Conference of the International Speech Communication Association, Dresden, Germany, 2015, ss. 533-537.
[13] L. Karray, ve A. Martin, “Toward improving speech detection robustness for speech recognition in adverse environments,” Speech Communication, c. 40, s. 3, ss. 261–276, 2003.
[14] J. Ramírez, J.C. Segura, C. Benítez, ve A. Torre, “A new adaptive longterm spectral estimation voice activity detector,” European Conference on Speech Communication and Technology, Geneva, Switzerland, 2003, ss. 3041–3044.
[15] J. Ramírez, “Spectral estimation voice activity detector,” European Conference on Speech Communication and Technology, Geneva, Switzerland, 2003, ss. 3121–3125.
[16] ITU-T Recommendation G.729-Annex B. “A silence compression scheme for G.729 optimized for terminals conforming to recommendation,” c. 70, 1996.
[17] F. Basbug, K. Swaminathan, ve S. Nandkumar, “Noise reduction and echo cancellation front-end for speech codecs,” Transaction Speech Audio Processing, c. 11, s. 1, ss. 1–13, 2004.
[18] S. Gustafsson, R. Martin, P. Jax, ve P. Vary, “A psychoacoustic approach to combined acoustic echo cancellation and noise reduction,” Transaction Speech and Audio Processing, c. 10, s. 5, ss. 245–256, 2002.
[19] J. Sohn, N.S. Kim, ve W. Sung, “A statistical model-based voice activity detection,” Signal Processing Letters, c. 16, s. 1, ss. 1–3, 1999.
[20] S. Gazor, ve W. Zhang, “A soft voice activity detector based on a Laplacian-Gaussian model,” Transaction Speech Audio Processing, c. 11, s. 5, ss. 498–505, 2003.
[21] L. Armani, M. Matassoni, M. Omologo, ve P. Svaizer, “Use of a CSP-based voice activity detector for distant-talking ASR,” European Conference on Speech Communication and Technology, Geneva, Switzerland, 2003, ss. 501–504.
[22] K. Woo, T. Yang, K. Park, ve C. Lee, “Robust voice activity detection algorithm for estimating noise spectrum,” Electronics Letters, c. 36, s. 2, ss. 180–181, 2000.
[23] M. Marzinzik, ve B. Kollmeier, “Speech pause detection for noise spectrum estimation by tracking power envelope dynamics,” Transaction Speech Audio Processing, c. 10, s. 6, ss. 341–351, 2002.
[24] R. Chengalvarayan, “Robust energy normalization using speech/non-speech discriminator for German connected digit recognition,” European Conference on Speech Communication and Technology, Budapest, Hungary, 1999, ss. 61–64.
[25] M. Marzinzik, ve B. Kollmeier, “Speech pause detection for noise spectrum estimation by tracking power envelope dynamics,” Transaction Speech Audio Processing, c. 10, s. 6, ss. 341–351, 2002.
[26] J. Zheng, Q. Zhou, ve C. Lee, “Robust, real-time endpoint detector with energy normalization for ASR in adverse environments,” International Conference on Acoustics, Speech, and Signal Processing, Lake City, UT, USA, 2001, ss. 233-236.
[27] C. Suyanto, “Signal energy-based automatic speech splitter: A tool for developing speech corpus,” Region 10 Conference, Taipei, Taiwan, 2007, ss. 2–5.
[28] M. Asadullah, ve S. Nisar, “A silence removal and endpoint detection approach for speech processing,” 3rd International Multidisciplinary Research Conference On Global Prosperity through Research & Innovation, Peşaver, Pakistan, 2013, ss. 10-15.
[29] X. Huang, ve L. Deng, “An overview of Modern Speech Recognition,” Handbook Natural Language Processing, 1. baskı, London, England: Chapman and Hall, 2010, böl. 3, ss. 339–367.
[30] D. Povey et al., “The Kaldi speech recognition toolkit,” Transactions on Audio, Speech, and Language Processing, Workshop on Automatic Speech Recognition and Understanding, Olomouc, Czech Republic, 2014, ss.1–4.
[31] S. Narang, ve M. Divya Gupta, “Speech Feature Extraction Techniques: A Review,” International Journal of Computer Science and Mobile Computing, c. 4, s. 3, ss. 107–114, 2015.
[32] A. Guglani, ve N. Mishra, “Continuous Punjabi Speech Recognition Model Based on Kaldi ASR Toolkit,” International Journal of Speech Technology, c. 18, s. 3, ss.1–6, 2018.
[33] B. Tombaloǧlu, ve H. Erdem, “Development of a MFCC-SVM based Turkish speech recognition system,” Signal Processing and Communication Application Conference, Zonguldak, Türkiye, 2016, ss. 1–4.
[34] A. R. Yuliani, R. Sustika, R. S. Yuwana, ve H. F. Pardede, “Feature transformations for robust speech recognition in reverberant conditions,” International Conference on Computer, Control, Informatics and its Applications, Jakarta, Indonesia, 2017, ss. 57-62.
[35] A. V. Haridas, R. Marimuthu, ve V. G. Sivakumar, “A Critical Review and Analysis on Techniques of Speech Recognition: The Road Ahead,” International Journal of Knowledge-Based and Intelligent Engineering Systems, c. 22, s. 1, ss. 39–57, 2018.
[36] M. Shahin, B. Ahmed, J. Mckechnie, K. Ballard, ve R. Gutierrez-osuna, “A comparison of GMM-HMM and DNN-HMM based pronunciation verification techniques for use in the assessment of childhood apraxia of speech,” Annual Conference of the International Speech Communication Association, Singapore, Singapore, 2014, ss.1583-1590.
[37] L. Saul, ve F. Pereira, “Aggregate and mixed-order Markov models for statistical language processing,” International Conference on Empirical Methods in Natural Language Processing, New Jersey, USA, 1997, ss.81-19.
[38] N. Guglani, ve J. Mishra, “Continuous Punjabi Speech Recognition Model Based on Kaldi ASR Toolkit,” International Journal Speech Technology, c. 17, s. 1, ss. 1–6, 2018.
[39] N. John, J. Wendy, ve N. Philip, “Sing formant frequencies in speech recognition,” 5th European Conference on Speech Communication and Technology, Rhodes, Greece, 1997, ss. 22-28.
[40] S. Chowdhury, U. Garain, ve T. Chattopadhyay, “A Weighted Finite-State Transducer (WFST)-based language model for online Indic script handwriting recognition, ” International Conference on Document Analysis and Recognition, Beijing, China, 2011, ss. 599–602.
[41] V. Shah, R. Anstotz, I. Obeid, ve J. Picone, “Adapting an ASR to event classification of electroencephalograms,” Signal Processing Medical Biology, Pennsylvania, USA, 2018, ss. 1–5.
[42] P. Chan, ve R. Lee, The Java class libraries : an annotated reference, 1. baskı, Boston, USA: Addison-Wesley, 1997, böl. 3, ss. 266-310.
[43] E. Arısoy, D. Can, S. Parlak, M. Saraçlar, ve H. Sak, “Turkish Broadcast News Transcription and Retrieval, ” Transactions on Audio, Speech, and Language Processing, c. 17, s. 5, ss. 874–883, 2009.