Betül USLU, Özkan ÇAYLI, Volkan KILIÇ, Aytuğ ONAN

Resnet Tabanlı Derin Geçitli Tekrarlayan Birim ile Akıllı Telefonda Görüntü Altyazılama

Görüntü altyazılama, görsel içerikler için dilbilgisel ve anlamsal olarak uygun doğal dil cümleleri oluşturmayı amaçlamaktadır. Geçitli tekrarlayan birim (GRU) tabanlı yaklaşımlar, son zamanlarda altyazı oluşturmadaki performanslarından dolayı büyük ilgi görmektedir. Kaybolan gradyan problemi ve derin ağlardaki ilgili bilgi akışının modülasyonunu sağlanması GRU'daki başlıca zorluklardır. Bu çalışmada, ilgili bilgilerin çoklu GRU katmanları kullanılarak aktarılmasını sağlamak, ve kaybolan gradyan sorununun üstesinden gelmek için resnet tabanlı bir derin GRU yaklaşımı önerilmektedir. Derin GRU'nun ardışık katmanları arasında artık bağlantılar kullanılmasıyla alt katmanlardan üst katmanlara doğru gradyan akışının iyileştirilmesi sağlanmıştır. Yaygın olarak kullanılan MSCOCO veri seti üzerindeki deneysel araştırmalar, önerilen yaklaşımın son yaklaşımlarla karşılaştırılabilir performans sağladığını göstermiştir. Ayrıca bu yaklaşım, internet bağlantısı olmaksızın altyazı oluşturma olanağı sunan ve sesle kontrol edilebilen bir arayüzü olan kendi tasarladığımız Android uygulamamıza CaptionEye gömülmüştür.

Anahtar Kelimeler:

Kapılı Tekrarlayan Birim, Artık Bağlantı, Görüntü Altyazılama, Android Uygulama.

Resnet based Deep Gated Recurrent Unit for Image Captioning on Smartphone

Image captioning aims at generating grammatically and semantically acceptable natural language sentences for visual contents. Gated recurrent units (GRU) based approaches have recently attracted much attention due to their performance in caption generation. Challenges with GRU are to deal with vanishing gradient problems and modulation of the most relevant information flow in deep networks. In this paper, we propose a resnet-based deep GRU approach to overcome the vanishing gradient problem with residual connections while the most relevant information is ensured to flow using multiple layers of GRU. Residual connections are employed between consecutive layers of deep GRU, which improves the gradient flow from lower to upper layers. Experimental investigations on the publicly available MSCOCO dataset prove that the proposed approach achieves comparable performance with some state-of-the-art approaches. Moreover, the approach is embedded into our custom-designed Android application, CaptionEye, which offers the ability to generate captions without an internet connection under a voice user interface.

Keywords:

Gated Recurrent Unit, Residual Connection, Image Captioning, Android Application.,

PDF

___

Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., & Zhang, L. (2018). Bottom-up and top-down attention for image captioning and visual question answering. Paper presented at the Proceedings of the IEEE conference on computer vision and pattern recognition.
Aydın, S., Çaylı, Ö., Kılıç, V., & Aytuğ Onan. (2022). Sequence-to-sequence video captioning with residual connected gated recurrent units. European Journal of Science and Technology((35), 380–386.
Baran, M., Moral, Ö. T., & Kılıç, V. (2021). Akıllı Telefonlar için Birleştirme Modeli Tabanlı Görüntü Altyazılama. European Journal of Science and Technology(26), 191-196.
Bengio, Y., Simard, P., & Frasconi, P. J. I. t. o. n. n. (1994). Learning long-term dependencies with gradient descent is difficult. 5(2), 157-166.
Chang, S.-F. (1995). Compressed-domain techniques for image/video indexing and manipulation. Paper presented at the Proceedings., International Conference on Image Processing.
Chen, T., Zhang, Z., You, Q., Fang, C., Wang, Z., Jin, H., & Luo, J. (2018). ``Factual''or``Emotional'': Stylized Image Captioning with Adaptive Learning and Attention. Paper presented at the Proceedings of the European Conference on Computer Vision (ECCV).
Chollet, F. (2017). Xception: Deep learning with depthwise separable convolutions. Paper presented at the Proceedings of the IEEE conference on computer vision and pattern recognition.
Chung, J., Gulcehre, C., Cho, K., & Bengio, Y. J. a. p. a. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling.
Çaylı, Ö., Makav, B., Kılıç, V., & Onan, A. (2020). Mobile Application Based Automatic Caption Generation for Visually Impaired. Paper presented at the International Conference on Intelligent and Fuzzy Systems.
Devlin, J., Cheng, H., Fang, H., Gupta, S., Deng, L., He, X., . . . Mitchell, M. J. a. p. a. (2015). Language models for image captioning: The quirks and what works.
Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., & Darrell, T. (2015). Long-term recurrent convolutional networks for visual recognition and description. Paper presented at the Proceedings of the IEEE conference on computer vision and pattern recognition.
Fetiler, B., Çaylı, Ö., Moral, Ö. T., Kılıç, V., & Aytuğ, O. (2021). Video Captioning Based on Multi-layer Gated Recurrent Unit for Smartphones. European Journal of Science and Technology(32), 221-226.
Hochreiter, S., & Schmidhuber, J. J. N. c. (1997). Long short-term memory. 9(8), 1735-1780.
Keskin, R., Çaylı, Ö., Moral, Ö. T., Kılıç, V., & Aytuğ, O. (2021). A Benchmark for Feature-injection Architectures in Image Captioning. European Journal of Science and Technology(31), 461-468.
Keskin, R., Moral, Ö. T., Kılıç, V., & Onan, A. (2021). Multi-GRU Based Automated Image Captioning for Smartphones. Paper presented at the 2021 29th Signal Processing and Communications Applications Conference (SIU).
Kılıç, V. (2021). Deep Gated Recurrent Unit for Smartphone-Based Image Captioning. Sakarya University Journal of Computer Information Sciences, 4(2), 181-191.
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., . . . Zitnick, C. L. (2014). Microsoft coco: Common objects in context. Paper presented at the European conference on computer vision.
Liu, S., Zhu, Z., Ye, N., Guadarrama, S., & Murphy, K. (2017). Improved image captioning via policy gradient optimization of spider. Paper presented at the Proceedings of the IEEE international conference on computer vision.
Makav, B., & Kılıç, V. (2019a). A new image captioning approach for visually impaired people. Paper presented at the 2019 11th International Conference on Electrical and Electronics Engineering (ELECO).
Makav, B., & Kılıç, V. (2019b). Smartphone-based image captioning for visually and hearing impaired. Paper presented at the 2019 11th International Conference on Electrical and Electronics Engineering (ELECO).
Mao, J., Wei, X., Yang, Y., Wang, J., Huang, Z., & Yuille, A. L. (2015). Learning like a child: Fast novel visual concept learning from sentence descriptions of images. Paper presented at the Proceedings of the IEEE international conference on computer vision.
Mao, J., Xu, W., Yang, Y., Wang, J., Huang, Z., & Yuille, A. J. a. p. a. (2014). Deep captioning with multimodal recurrent neural networks (m-rnn).
Nina, O., & Rodriguez, A. (2015). Simplified LSTM unit and search space probability exploration for image description. Paper presented at the 2015 10th International Conference on Information, Communications and Signal Processing (ICICS).
Qin, X., & Wang, Z. J. a. p. a. (2019). Nasnet: A neuron attention stage-by-stage net for single image deraining.
Rahman, A., Srikumar, V., & Smith, A. D. J. A. e. (2018). Predicting electricity consumption for commercial and residential buildings using deep recurrent neural networks. 212, 372-385.
Rennie, S. J., Marcheret, E., Mroueh, Y., Ross, J., & Goel, V. (2017). Self-critical sequence training for image captioning. Paper presented at the Proceedings of the IEEE conference on computer vision and pattern recognition.
Sagheer, A., & Kotb, M. J. N. (2019). Time series forecasting of petroleum production using deep LSTM recurrent networks. 323, 203-213.
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2016). Rethinking the inception architecture for computer vision. Paper presented at the Proceedings of the IEEE conference on computer vision and pattern recognition.
Tanti, M., Gatt, A., & Camilleri, K. P. J. N. L. E. (2018). Where to put the image in an image caption generator. 24(3), 467-489.
Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. (2015). Show and tell: A neural image caption generator. Paper presented at the Proceedings of the IEEE conference on computer vision and pattern recognition.
Wang, B., Kong, W., Guan, H., & Xiong, N. N. J. I. A. (2019). Air quality forecasting based on gated recurrent long short term memory model in Internet of Things. 7, 69524-69534.
Wang, Q., Bu, S., & He, Z. J. I. T. o. I. I. (2020). Achieving predictive and proactive maintenance for high-speed railway power equipment with LSTM-RNN. 16(10), 6509-6517.
Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., Macherey, W., . . . Macherey, K. J. a. p. a. (2016). Google's neural machine translation system: Bridging the gap between human and machine translation.
Yang, M., Liu, J., Shen, Y., Zhao, Z., Chen, X., Wu, Q., & Li, C. J. I. T. o. I. P. (2020). An ensemble of generation-and retrieval-based image captioning with dual generator generative adversarial network. 29, 9627-9640.
Yao, T., Pan, Y., Li, Y., Qiu, Z., & Mei, T. (2017). Boosting image captioning with attributes. Paper presented at the Proceedings of the IEEE international conference on computer vision.
You, Q., Jin, H., & Luo, J. J. a. p. a. (2018). Image captioning at will: A versatile scheme for effectively injecting sentiments into image descriptions.
Yu, J., Li, J., Yu, Z., Huang, Q. J. I. t. o. c., & technology, s. f. v. (2019). Multimodal transformer with multi-view visual representation for image captioning. 30(12), 4467-4480.