Halil İbrahim YALMAN, Zekeriya TÜFEKCİ

Konuşma Tanımaya Uygulanan BiRNN, BiLSTM ve BiGRU Modellerinin Performans Değerlendirmesi

Konuşma tanıma ses dalgalarının yazıya dönüştürülmesi işlemidir. Bu çalışmada sesli kitap veri seti üzerinde Çift Yönlü Basit Tekrarlayan Ağlar (BiRNN), Çift Yönlü Uzun Kısa Süreli Bellek (BiLSTM), Çift Yönlü Kapılı Tekrarlayan Hücreler (BiGRU) modellerinin konuşma tanıma performansı incelenmiş ve karşılaştırması yapılmıştır. Kullanılan modellerde Bağlantıcı Zamansal Sınıflandırma (CTC) ve Evrişimsel Sinir Ağları (CNN) kullanılmıştır. Ayrıca bu modellerin tek yönlü versiyonları ile karşılaştırması da yapılmıştır. Çalışmanın sonucunda en yüksek konuşma tanıma başarı oranına sahip model BiLSTM olduğu saptanmıştır. Bununla birlikte %33 daha az para metre ile %3 daha düşük konuşma tanıma oranına sahip BiGRU modeli de dikkate değer bulunmuştur. Çift yönlü modellerin tek yönlü modellere göre daha başarılı sonuçlar verdiği saptanmıştır.

Anahtar Kelimeler:

Konuşma Tanıma, Derin Öğrenme, Evrişimsel Sinir Ağları, Çift Yönlü Uzun Kısa Süreli Bellek, Çift Yönlü Basit Tekrarlayan Ağlar, Çift Yönlü Kapılı Tekrarlayan Hücreler, Bağlantıcı Zamansal Sınıflandırma, Türkçe Sesli Kitap Veri seti

Performance Evaluation of BiRNN, BiLSTM and BiGRU Models Applied to Speech Recognition

Speech recognition is the process of converting sound waves into text. In this study, speech recognition performance of Bidirectional Recurrent Neural Network (BiRNN), Bidirectional Long Short Term Memory (BiLSTM), Bidirectional Gated Recurrent Units (BiGRU) models on the audiobook dataset was examined and compared. Connectionist Temporal Classification (CTC) and Convolutional Neural Networks (CNN) are used in the models. In addition, these models were compared with their unidirectional versions. As a result of the study, it was determined that the model with the highest speech recognition success rate was BiLSTM. However, the BiGRU model, which has 33% less parameters and 3% lower speech recognition rate, was also found to be remarkable. It has been determined that bidirectional models give more successful results than unidirectional models.

Keywords:

Speech Recognition, Deep Learning, Convolutional Neural Networks, Bidirectional Long Short Term Memory, Bidirectional Recurrent Neural Networks, Bidirectional Gated Recurrent Units, Connectionist temporal classification, Turkish Audiobook Dataset,

PDF

___

Arisoy, E., Sethy, A., Ramabhadran, B., & Chen, S. (2015, April). Bidirectional recurrent neural network language models for automatic speech recognition. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5421-5425). IEEE.
Bhuvaneswari, A., Thomas, J. T. J., & Kesavan, P. (2019). Embedded Bi-directional GRU and LSTMLearning Models to Predict Disasterson Twitter Data. Procedia Computer Science, 165, 511-516.
Cho, K., Van Merriënboer, B., Bahdanau, D., & Bengio, Y. (2014). On the properties of neural machine translation: Encoder-decoder approaches. arXiv preprint arXiv:1409.1259.
Graves, A., Fernández, S., Gomez, F., & Schmidhuber, J. (2006, June). Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning (pp. 369-376).
Graves, A., Mohamed, A. R., & Hinton, G. (2013, May). Speech recognition with deep recurrent neural networks. In 2013 IEEE international conference on acoustics, speech and signal processing (pp. 6645-6649). Ieee.
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT press.
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8), 1735-1780. Mporas, I., Ganchev, T., Siafarikas, M., & Fakotakis, N. (2007). Comparison of speech features on the speech recognition task. Journal of Computer Science, 3(8), 608-616.
LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., & Jackel, L. D. (1989). Backpropagation applied to handwritten zip code recognition. Neural computation, 1(4), 541-551.
Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. nature, 323(6088), 533-536. Sak, H., Senior, A., & Beaufays, F. (2014). Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition. arXiv preprint arXiv:1402.1128.
Schuster, M., & Paliwal, K. K. (1997). Bidirectional recurrent neural networks. IEEE transactions on Signal Processing, 45(11), 2673-2681.
Shewalkar, A. (2019). Performance evaluation of deep neural networks applied to speech recognition: RNN, LSTM and GRU. Journal of Artificial Intelligence and Soft Computing Research, 9(4), 235-245.
Tak, R. N., Agrawal, D. M., & Patil, H. A. (2017, December). Novel phase encoded mel filterbank energies for environmental sound classification. In International Conference on Pattern Recognition and Machine Intelligence (pp. 317-325). Springer, Cham.
Yalman, H. İ., & Tüfekci, Z. (2022). Yeni Bir Türkçe Sesli Kitap Veri Seti Üzerinde Convolutional RNN+ CTC, LSTM+ CTC ve GRU+ CTC Modellerinin Karşılaştırılması. Avrupa Bilim ve Teknoloji Dergisi, (34), 321-327.
Zeyer, A., Doetsch, P., Voigtlaender, P., Schlüter, R., & Ney, H. (2017, March). A comprehensive study of deep bidirectional LSTM RNNs for acoustic modeling in speech recognition. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 2462-2466). IEEE.
Zhang, A., Lipton, Z. C., Li, M., & Smola, A. J. (2021). Dive into deep learning. arXiv preprint arXiv:2106.11342.