LSTM Hiperparametrelerinin Ses Tanıma Performansına olan Etkilerinin Araştırılması

Bilgisayara dayalı hesaplamalı metotlar ve donanım teknolojilerindeki gelişmelerle birlikte, bilgisayarlar ses tanıma ve görüntü işleme gibi zor görevlerin üstesinden gelme konusunda daha güçlü hale gelmiştir. Ses tanıma, hesaplamalı veya analitik yöntemler kullanarak ses sinyallerinin metinsel karşılığını çıkarma görevidir. Ses tanıma aksanlar ve diller arasındaki değişkenlikler, güçlü donanım gereksinimleri, doğru modellerin üretilebilmesi için büyük veri setlerine olan ihtiyaç ve ses kalitesini etkileyen çevresel faktörlerden dolayı zor bir problemdir. Son yıllarda, Grafiksel İşleme Birimleri gibi donanım cihazlarının yükselen veri işleme yetenekleri yardımıyla derin öğrenme metotları, özellikle Özyinelemeli Sinir Ağları (ÖSA – Recurrent Neural Networks, RNN) ve RNN’in bir varyantı olan LSTM (Long Short Term Memory – Uzun Kısa Dönem Hafıza), ses tanıma alanında çok yaygın ve kabul gören metotlar haline gelmişlerdir. Literatürde, RNN ve LSTM ses tanıma ve ses tanımanın uygulamaları için katman sayısı, gizli katman sayısı ve yığın boyutu gibi çeşitli parametrelerle kullanılmaktadır. Kullanılan bu parametre değerlerin hangi kriterlere göre seçildiği ve bu parametre değerlerinin daha sonraki çalışmalarda da kullanılabilirliği ise incelenmemiştir. Bu çalışmada, LSTM hiperparametrelerinin ses tanıma performansına olan etkileri hata oranları ve derin mimari maliyeti dikkate alınarak incelenmiştir. Her bir parametre ayrı olarak değerlendirilmiş ve bu esnada diğer parametreler sabit tutulmuş ve parametrelerin ses verisi üzerindeki etkisi gözlemlenmiştir. Deneysel sonuçlarda, daha düşük hata oranları ve daha iyi ses tanıma performansı elde edebilmek için her parametrenin seçilen eğitim seti için farklı değerlere sahip olduğu görülmüştür. Bu çalışmanın sonuçlarına göre, LSTM için en uygun parametrelerin seçilmesinden önce ses veri kümesi üzerinde farklı deneyler yapılarak her bir parametre için en uygun değerin bulunması gerektiği gözlemlenmiştir.

Investigation of the Effect of LSTM Hyperparameters on Speech Recognition Performance

With the recent advances in hardware technologies and computational methods, computers became more powerful for analyzingdifficult tasks, such as speech recognition and image processing. Speech recognition is the task of extraction of text representation ofa speech signal using computational or analytical methods. Speech recognition is a challenging problem due to variations in accents and languages, powerful hardware requirements, big dataset needs for generating accurate models, and environmental factors thataffect signal quality. Recently, with the increasing processing ability of hardware devices, such as Graphical Processing Units, deeplearning methods became more prevalent and state-of-the-art method for speech recognition, especially Recurrent Neural Networks(RNNs) and Long-Short Term Memory (LSTMs) networks which is a variant of RNNs. In the literature, RNNs and LSTMs are usedfor speech recognition and the applications of speech recognition with various parameters, i.e. number of layers, number of hiddenunits, and batch size. It is not investigated that how the parameter values of the literature are selected and whether these values couldbe used in future studies. In this study, we investigated the effect of LSTMs hyperparameters on speech recognition performance interms of error rates and deep architecture cost. Each parameter is investigated separately while other parameters remain constant andthe effect of each parameter is observed on a speech corpus. Experimental results show that each parameter has its specific values forthe selected number of training instances to provide lower error rates and better speech recognition performance. It is shown in thisstudy that before selecting appropriate values for each LSTM parameters, there should be several experiments performed on thespeech corpus to find the most eligible value for each parameter.

PDF

___

Gao, C., Braun, S., Kiselev, I., Anumula, J., Delbruck, T., & Liu, S. C. (2019, May). Real-time speech recognition for IoT purpose using a delta recurrent neural network accelerator. In 2019 IEEE International Symposium on Circuits and Systems (ISCAS) (pp. 1-5). IEEE.
Graves, A., Mohamed, A. R., & Hinton, G. (2013, May). Speech recognition with deep recurrent neural networks. In 2013 IEEE international conference on acoustics, speech and signal processing (pp. 6645-6649). IEEE.
Graves, A., Jaitly, N., & Mohamed, A. R. (2013b, December). Hybrid speech recognition with deep bidirectional LSTM. In 2013 IEEE workshop on automatic speech recognition and understanding (pp. 273-278). IEEE.
He Y., Sainath T. N., Prabhavalkar R., McGraw I., Alvarez R., Zhao D., Rybach D., Kannan A., Wu Y., Pang R., Liang Q., Bhatia D., Shangguan Y., Li B., Pundak G., Sim K. C., Bagby T., Chang S., Rao K., and Gruenstein A. (2019, May). Streaming end-to-end speech recognition for mobile devices. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6381-6385). IEEE.
Hori, T., Watanabe, S., Zhang, Y., & Chan, W. (2017). Advances in joint CTC-attention based end-to-end speech recognition with a deep CNN encoder and RNN-LM. arXiv preprint arXiv:1706.02737.
Lee, K., Park, C., Kim, N., & Lee, J. (2018, April). Accelerating recurrent neural network language model based online speech recognition system. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5904- 5908). IEEE.
Liu, X., Liu, S., Sha, J., Yu, J., Xu, Z., Chen, X., & Meng, H. (2018, April). Limited-memory bfgs optimization of recurrent neural network language models for speech recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6114-6118). IEEE.
Miao, Y., Gowayyed, M., & Metze, F. (2015, December). EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding. In 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU) (pp. 167-174). IEEE.
Sainath T. N., Pang R., Rybach D., He Y., Prabhavalkar R., Li W., Visontai M., Liang Q., Strohman T., Wu Y., McGraw I., and Chiu C.-C. (2019). Two-Pass End-to-End Speech Recognition, In INTERSPEECH 2019, Graz, Austria, 2019.
Toshniwal, S., Sainath, T. N., Weiss, R. J., Li, B., Moreno, P., Weinstein, E., & Rao, K. (2018, April). Multilingual speech recognition with a single end-to-end model. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4904-4908). IEEE.
Veaux C., Yamagishi J., and MacDonald K. (2017, 04/02/2020). Cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit. Available: https://datashare.is.ed.ac.uk/handle/10283/2651.
Wang, S., Zhou, P., Chen, W., Jia, J., & Xie, L. (2019, November). Exploring RNN-Transducer for Chinese speech recognition. In 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) (pp. 1364- 1369). IEEE.
Yu, D., & Deng, L. (2016). Automatic Speech Recognition: A Deep Learning Approach. Springer.