Ayşenur Yılmaz, Yaşar Said Derdiman, Turgay Koç

Ses Telleri Görüntülerinde Otomatik Piksel Tabanlı Sınıflandırma için Performans Ölçütlerinin İncelenmesi

Son yıllarda yapılan konuşma sistemi ile ilgili sorunların tespit edilmesinde ve konuşma analizinde gelişen teknolojinin getirdiği imkanlar sayesinde ses tellerinin yüksek hızlı görüntüleri yaygın olarak kullanılmaya başlanmıştır. Bu yüksek hızlı görüntüler konuşmacının ses tellerinin vibrasyonuyla ilgili detaylı bilgiler içerir. Fakat verinin büyüklüğü göz önüne alındığında bu görüntülerin manuel olarak işlenmesi mümkün görünmemektedir. Bu nedenle son yıllarda geliştirilen otomatik görüntü işleme algoritmaları ile ses telleri görüntülerinden glottis tespiti ve bölütlenmesi popüler hale gelmiştir. Bu çalışmada literatürdeki çalışmalardan farklı olarak ses telleri görüntülerinin piksel tabanlı otomatik sınıflandırılabilmesi için kullanılabilecek olan doğruluk, keskinlik (hassasiyet), geri çağırma, F1 skoru ve eşit hata oranı performans ölçütleri incelenmiştir. Bununla birlikte literatürdeki piksel tabanlı sınıflandırma modeli olan derin yapay sinir ağı temel sistem olarak alınarak yeni önerilen Gauss Karışım Modeli tabanlı sistem ile kıyaslanmıştır. Boyutları 256x256 olan manuel olarak bölütlenmiş 3000 adet yüksek hızlı endoskopik kamera görüntüsü rasgele olarak eğitim, geliştirme ve değerlendirme veri setlerini oluşturmak için kullanılmıştır. Veri seti ile eğitilen modellerin, geliştirme ve değerlendirme setleri ile yapılan çalışmalar sonucunda ikili sınıflandırmada yaygın olarak kullanılan doğruluk, keskinlik, geri çağırma ve F1 skoru ölçütlerinin modelden modele yaklaşık sadece %1 oranında değiştiği ve bu sonuçların sistem performansını yansıtma konusunda, aynı durumda % 22 değişim gösterebilen eşit hata oranı kadar etkili olmadığını göstermiştir. Bu çalışmanın sonucunda sistemlerin doğruluk değerleri aynı kalsa bile eşit hata oranı farkları değişebilmekte, bu nedenle aşırı uydurulmuş sistemlerin daha doğru kestirilebildiği gösterilmektedir. Temel sistem ile önerilen modeller karşılaştırıldığında, önerilen sistem 4096 karışımlı Gauss Karışım Modeli, kullanılan bütün performans ölçütleri için en iyi sonucu vermiş olup, değerlendirme setindeki eşit hata oranı için %22’lik bir performans iyileştirmesi göstermiştir.

Analysis of Performance Metrics for Automatic Pixel-Based Classification in Vocal Cord Images

In recently years, thanks to the opportunities brought by the developing technology, high-speed images of the vocal cords have been started to widely use in detection of problems with the speech system and analysis of speech. These high-speed images contain detailed information about the vibration of the speaker's vocal cords. However, considering the size of the image data, it does not seem possible to manually process these images. For this reason, glottis detection and segmentation from vocal cord images has become popular with the development of automatic image processing algorithms in recent years. Unlike the other literature studies, in this study, the accuracy, precision (sensitivity), recall, F1-score and equal error rate performance criteria are examined used to automatically classify vocal cord images based on pixels. In addition to this, deep artificial neural network, that pixel classification based model in the literature, has been compared with the newly proposed model Gaussian Mixture Model. 3000 high speed endoscopic camera images manually segmented with dimensions 256x256 pixels were used to generate training, development and evaluation data sets of randomly. As a result of the studies conducted with the validation and evaluation sets of models trained with the data set, the accuracy, precision, recall and F1 score criteria, which are commonly used in binary classification, changed only by 1% from model to model. And this result has shown that other performance metrics are not as effective as equal error rate that reflecting the system 22% change in the same situation. As a result of this study, even if the accuracy values of the systems remain the same, equal error rate differences may change, therefore it has been shown that overfitted systems can be predicted more accurately. Comparing the models proposed with the based system, the proposed system gave the best result for all performance criteria using the 4096 component Gaussian Mixture Model, and it is showed a performance improvement of 22% for the equal error rate in the evaluation set.

PDF

___

[1] Cen, Q., Pan, Z., Li, Y., & Ding, H. (2019, January). Laryngeal Tumor Detection in Endoscopic Images Based on Convolutional Neural Network. In 2019 IEEE 2nd International Conference on Electronic Information and Communication Technology (ICEICT) (pp. 604-608). IEEE.
[2] Turkmen, H. I., Karsligil, M. E., & Kocak, I. (2015). Classification of laryngeal disorders based on shape and vascular defects of vocal folds. Computers in biology and medicine, 62, 76-85.
[3] Aubreville, M., Knipfer, C., Oetter, N., Jaremenko, C., Rodner, E., Denzler, J., ... & Maier, A. (2017). Automatic classification of cancerous tissue in laserendomicroscopy images of the oral cavity using deep learning. Scientific reports, 7(1), 1-10.
[4] Drioli, C., & Foresti, G. L. (2020). Fitting a biomechanical model of the folds to high-speed video data through bayesian estimation. Informatics in Medicine Unlocked, 20, 100373.
[5] Khairuddin, K. A. M., Ahmad, K., Ibrahim, H. M., & Yan, Y. (2020). Description of the Features and Vibratory Behaviors of the Nyquist Plot Analyzed From Laryngeal High-Speed Videoendoscopy Images. Journal of Voice.
[6] Fehling, M. K., Grosch, F., Schuster, M. E., Schick, B., & Lohscheller, J. (2020). Fully automatic segmentation of glottis and vocal folds in endoscopic laryngeal high-speed videos using a deep Convolutional LSTM Network. Plos one, 15(2), e0227791.
[7] Andrade-Miranda, G., Stylianou, Y., Deliyski, D. D., Godino-Llorente, J. I., & Henrich Bernardoni, N. (2020). Laryngeal Image Processing of Vocal Folds Motion. Applied Sciences, 10(5), 1556.
[8] Deliyski, D. D., Powell, M. E., Zacharias, S. R., Gerlach, T. T., & de Alarcon, A. (2015). Experimental investigation on minimum frame rate requirements of high-speed videoendoscopy for clinical voice assessment. Biomedical Signal Processing and Control, 17, 21-28.
[9] Ogutcen, M. Y. Koc, T., (2019). Yüksek Hızlı Ses Telleri Görüntülerinin Düzlemsel Aydınlatma Modeli ile Aktif Kontur Tabanlı Segmentasyonu, EEMKON 2019, Elektrik Elektronik Mühendisliği Kongresi, p.427-431.
[10] Yan, Y., Chen, X., & Bless, D. (2006). Automatic tracing of vocal-fold motion from high-speed digital images. IEEE Transactions on Biomedical Engineering, 53(7), 1394-1400.
[11] Zhang, Y., Bieging, E., Tsui, H., & Jiang, J. J. (2010). Efficient and effective extraction of vocal fold vibratory patterns from highspeed digital imaging. Journal of Voice, 24(1), 21-29.
[12] Yan, Y., Du, G., Zhu, C., & Marriott, G. (2012, March). Snake based automatic tracing of vocal-fold motion from high-speed digital images. In 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 593-596). IEEE.
[13] Andrade-Miranda, G., & Godino-Llorente, J. I. (2017). Glottal Gap tracking by a continuous background modeling using inpainting. Medical & Biological Engineering & Computing, 55(12), 2123-2141.
[14] Pinheiro, A. P., Dajer, M. E., Hachiya, A., Montagnoli, A. N., & Tsuji, D. (2014). Graphical evaluation of vocal fold vibratory patterns by high-speed videolaryngoscopy. Journal of Voice, 28(1), 106-111.
[15] Rao, M. A., Krishnamurthy, R., Gopikishore, P., Priyadharshini, V., & Ghosh, P. K. (2018, January). Automatic Glottis Localization and Segmentation in Stroboscopic Videos Using Deep Neural Network. In INTERSPEECH (pp. 3007-3011).
[16] Schenk, F., Aichinger, P., Roesner, I., & Urschler, M. (2015). Automatic high-speed video glottis segmentation using salient regions and 3D geodesic active contours. Annals of the British Machine Vision Association, 2015(1), 1-15.
[17] Kopczynski, B., Strumillo, P., Just, M., & Niebudek-Bogusz, E. (2018, November). Acoustic Based Method for Automatic Segmentation of Images of Objects in Periodic Motion: Detection of vocal folds edges case study. In 2018 Eighth International Conference on Image Processing Theory, Tools and Applications (IPTA) (pp. 1-6). IEEE.
[18] Hamad, A., Haney, M., Lever, T. E., & Bunyak, F. (2019). Automated Segmentation of the Vocal Folds in Laryngeal Endoscopy Videos Using Deep Convolutional Regression Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (pp. 0-0).
[19] Koç, T., & Çiloğlu, T. (2014). Automatic segmentation of high speed video images of vocal folds. Journal of Applied Mathematics, 2014.
[20] Gloger, O., Lehnert, B., Schrade, A., & Völzke, H. (2014). Fully automated glottis segmentation in endoscopic videos using local color and shape features of glottal regions. IEEE Transactions on Biomedical Engineering, 62(3), 795-806.
[21] Goodfellow, I., Bengio, Y., Courville, A., & Bengio, Y. (2016). Deep learning (Vol. 1, p. 2). Cambridge: MIT press.
[22] Kasapoğlu, B., & Turgay, K. O. Ç. (2020) Sentetik ve Dönüştürülmüş Konuşmaların Tespitinde Genlik ve Faz Tabanlı Spektral Özniteliklerin Kullanılması. Avrupa Bilim ve Teknoloji Dergisi, 398-406.
[23] Degottex, G., & Bianco, E. (2010). IRCAM Databases of High Speed Videoendoscopy. UPMC-Ircam, France.
[24] Chollet, F. (2018). Deep Learning mit Python und Keras: Das Praxis-Handbuch vom Entwickler der Keras-Bibliothek. MITPVerlags GmbH & Co. KG.
[25] Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.