Necaattin BARIŞÇI, Nursal ARICI, Recep Sinan ARSLAN, Sabri KOÇER

Detecting and correcting automatic speech recognition errors with a new model

The purpose of automatic speech recognition (ASR) systems is to recognize speech signals obtained from people and convert them into text so that they can be processed by a computer. Although many ASR applications are versatile and widely used in the real world, they still generate relatively inaccurate results. They tend to generate spelling errors in recognized words, especially in noisy environments, in situations where the vocabulary size is increased, and at times when the input speech is of poor quality. The permanent presence of errors in ASR systems has led to the need to find alternative methods for automatic detection and correction of such errors. In this study, the basic principles of ASR evaluation are first summarized, and then a new approach based on the suggestion of an alternative hypothesis is proposed for the detection and correction of these errors generated by ASR systems. The proposed method involves a series of processes such as identifying incorrect words, selecting the ones that can be corrected, and identifying candidate words to replace these words. As a result of the tests carried out by creating different test environments, significant performance improvements for Turkish were achieved and an average of 4.60 % performance improvement was provided.

PDF

___

[1] Dutoit T. An introduction to text‐to‐speech synthesis. Berlin, Germany: Springer Science and Business Media, 2001.
[2] Sak H, Saraçlar M, Güngör T. Morpholexical and discriminative language models for Turkish automatic speech recognition. IEEE Transactions on Audio, Speech and Language Processing 2012; 20 (8): 1-11.
[3] Anusuya MA, Katti SK. Speech recognition by machine: a review. International Journal of Computer Science and Information Security 2009; 4 (3): 181-205.
[4] Kandarpa KS, Mousmita S. Acoustic modelling of speech signal using artificial neural network: a review of techniques and current trends. Intelligent Applications for Heterogeneous System Modeling and Design 2015; 1 (12): 287-303.
[5] Deng L, Huang X. Challenges in adopting speech recognition. Communications of the ACM 2004; 47 (1): 60-75.
[6] Forsberg M. (2003). Why speech recognition is difficult [online]. Website http://www.speech.kth.se/ [accessed 01 Jan 2021]
[7] Jeong M, Lee GG. Improving speech recognition and understanding using error-corrective reranking. ACM Transactions on Asian Language Information Processing (TALLIP) 2008; 7 (1): 1-16.
[8] Jeena HP, Golda BR, Hema AM. Importance of signal processing cues in transcription correction for low-resource Indian languages. ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP) 2020; 19 (1): 1-26.
[9] Rahhal E, Asmaa H, Hassan O. Automatic speech recognition errors detection and correction: a review. Procedia Computer Science 2018; 128: 32-37.
[10] Youssef B, Semaan P. ASR context-sensitive error correction based on Microsoft n-gram dataset. Journal of Computing 2012; 4 (1): 1-9.
[11] The formation of Turkish and the place of Turkish among world languages (2020). Home Page [online]. Website http://www.turkcede.org/turk-dili/729-turkcenin-olusumu-ve-turkcenin-dunya-dilleri-arasindakiyeri.html [accessed 01 Jan 2021].
[12] Aksoylar C, Mutluergil SO, Erdoğan H. The anatomy of a Turkish speech recognition system. In: IEEE Signal Processing and Communications Applications Conference (SIU); Antalya, Turkey; 2009. pp. 512-515.
[13] Özbey C, Bayar S. Automatic speech recognition: generating and testing generic acoustic model for Turkish. In: 19. Academic Conference on Informatics; Aksaray, Turkey; 2017. pp. 1-6.
[14] Asefisaray B, Mengüşoğlu E, Hacıömeroğlu M, Sever H. How does language model size effects speech recognition accuracy for the Turkish language?. Pamukkale University Journal of Engineering Sciences 2016; 22 (2): 100-105.
[15] Setlur AR, Sukkar RA, Jacob J. Correcting recognition errors via discriminative utterance verification. In: Proceedings of the International Conference on Spoken Language Processing; Philadelphia, USA; 1996. pp. 602-605.
[16] Zhou Z, Meng HM, Lo WK. A multi‐pass error detection and correction framework for Mandarin LVCSR. In: Proceedings of the International Conference on Spoken Language Processing; Pittsburgh, USA; 2006. pp. 1646- 1649.
[17] Rosenblatt FF. The perceptron: a probabilistic model for information storage and organization in the brain. Psychological Review 1958; 65 (6): 386-409.
[18] Socher R, Bengio Y, Manning C. Deep learning for NLP (without magic). In: ACL Annual Meeting of the Association for Computational Linguistics; Jeju Island, Korea; 2013. pp. 5-25.
[19] Arslan RS, Barışçı N. The effect of different optimization techniques on end-to-end Turkish speech recognition systems that use connectionist temporal classification. In: 2018 2nd International Symposium on Multidisciplinary Studies and Innovative Technologies (ISMSIT); Ankara, Turkey; 2018. pp. 1-6.
[20] Asefisaray B. End-to-end speech recognition model: tests in Turkish language. PhD, Hacettepe University, Ankara, Turkey, 2018.
[21] Hochreiter S, Schmidhuber J. Long short-term memory. Neural Computing 1997; 9 (8): 1735-1780.
[22] Schmidhuber J, Greff K, Srivasava RK, Kutnik J, Steunebrink BR. LSTM: a search space odyssey. IEEE Transactions on Neural Networks and Learning Systems 2017; 28 (10): 2222-2232.
[23] Sundermeyer M, Schlüter R, Ney H. LSTM neural networks for language modeling. In: Conference of the International Speech Communication Association (INTERSPEECH); Portland, USA; 2012. pp. 1-4.
[24] Sak H, Senior A, Beaufays F. Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In: International Speech Communication Association; Singapore; 2014. pp. 1-5.
[25] Aydoğan M, Karci A. Improving the accuracy using pre-trained word embedding on deep neural networks for Turkish text classification. Physica A: Statistical Mechanics and its Applications 2020; 541: 1-17.
[26] Levenshtein VI. Binary codes capable of correcting deletions, insertions, and reversals. Cybernetics and Control Theory 1996; 10 (8): 707-710.
[27] Salor Ö, Pellom BL, Çiloğlu T, Demirekler M. Turkish speech corpora and recognition tools developed by porting sonic: towards multilingual speech recognition. Computer Speech and Language 2007; 21 (4): 580-593.
[28] Salor Ö, Pellom B, Çiloğlu T, Hacıoğlu K, Demirekler M. On developing new text and audio corpora and speech recognition tools for the Turkish language. In: International Conference on Spoken Language Processing (ICSLP); Denver, USA; 2002. pp. 1-5.
[29] Akın AA, Akın MD. Zemberek, an open source NLP framework for Turkish languages. Structure 2007; 10: 1-5.
[30] Sak H, Güngör T, Saraçlar M. Turkish language resources: morphological parser, morphological disambiguator and web corpus. In: 6th International Conference on Advances in Natural Language Processing; Gothenburg, Sweden; 2008. pp. 417-427.
[31] Arslan RS, Barışçı N. Development of output correction methodology for long short term memory-based speech recognition. Sustainability 2019; 11 (15): 4250-4266.
[32] Yohei F, Katsuyuki T, Tetsuya T, Yasua A. Word-error correction of continuous speech recognition based on normalized relevance distance. In: Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence; Buenos Aires, Argentina; 2015. pp. 1-6.
[33] Yusuke N, Zhipeng Z, Nobuhiko N. Efficient speech-recognition error correction for more usable speech-to-text input. Ntt Docomo Technical Journal 2011; 11 (2): 1-8.
[34] Dong Y, Mei-Yuh H, Mau P, Alex A, Deng L. Unsupervised learning from users error correction in speech dictation. In: International Conference on Spoken Language Processing; Jeju Island, Korea; 2004. pp. 1969-1972.
[35] Yongmei S, Zhaou L. Supporting dictation speech recognition error correction: the impact of external information. Behaviour and Information Technology 2009; 30 (6): 761-774.