Türkçe dilinde görüntü altyazısı: veritabanı ve model

Otomatik görüntü altyazısı, yapay zekânın hem bilgisayarla görme hem de doğal dil işleme alanlarını kapsayan bir konudur. Makine çevirisi alanındaki gelişmelerden ilham alan ve bu alanda başarılı sonuçlar veren kodlayıcı-kod çözücü tekniği, şu anda İngilizce görüntü altyazısı konusunda kullanılan mevcut yöntemlerden biridir. Bu çalışmada, Türkçe dili için otomatik görüntü altyazısı oluşturan bir model sunulmaktadır. Bu çalışma, verilen görüntülerin özelliklerini çıkarmaktan sorumlu olan bir CNN kodlayıcıyı, altyazı oluşturmaktan sorumlu olan bir RNN kod çözücüsü ile birleştirilerek, Türkçe MS COCO veri tabanını üzerinde Türkçe görüntü altyazısı kodlayıcı-kod çözücü modelini test etmektedir. Üretken modelin performansı yeni oluşturulan veri tabanında hem BLEU, METEOR, ROUGE ve CIDEr gibi en yaygın değerlendirme ölçütleri hem de insan tabanlı yöntemler kullanılarak değerlendirilmiştir. Sonuçlar, önerilen modelin performansının hem niteliksel hem de niceliksel olarak tatmin edici olduğunu göstermektedir. Çalışma sonunda hazırlanan, herkesin kullanımına açık bir Web uygulaması uygulaması[1] sayesinde Türkçe dili için MS COCO görüntülerine ait Türkçe girişlerin yapıldığı bir ortam kullanıcıya sunulmuştur. Tüm görüntüler tamamlandığında, Türkçe’ye özgü ve karşılaştırmalı çalışmaların yapıldığı bir veri kümesi tamamlanmış olacaktır. [1] http://mscococontributor.herokuapp.com/website/

Anahtar Kelimeler:

Türkçe görüntü altyazısı, Türkçe MS COCO, Bilgisayarlı görme, Doğal dil işleme, CNN, RNN

Images captioning in Turkish language: database and model

Automatic image captioning is a challenging issue in artificial intelligence, which covers both the fields of computer vision and natural language processing. Inspired by the later advances in machine translation, a successful encoder-decoder technique is currently the state-of-the-art in English language captioning. In this study, we proposed an image captioning model for Turkish Language. This paper evaluate the encoder-decoder model on MS COCO database by coupling an encoder CNN -the component that is responsible for extracting the features of the given images-, with a decoder RNN -the component that is responsible for generating captions using the given inputs- to generate Turkish captions. We conducted the experiments using the most common evaluation metrics such as BLEU, METEOR, ROUGE and CIDEr. Results show that the performance of the proposed model is satisfactory in both qualitative and quantitatively evaluations. Finally, this study introduces a Web platform, which is proposed to improve the dataset via crowd-sourcing and free to use. The Turkish MS COCO database is available for research purpose. When all the images are completed, a Turkish dataset will be available for comparative studies.

Keywords:

Turkish image captioning, Turkish MS COCO, Computer vision, Natural language processing, CNN, RNN,

PDF

___

Yang, Y., Teo, C.L., Daume, H. ve Aloimono, Y., Corpus-Guided Sentence Generation of Natural Images, Conference on Empirical Methods in Natural Language Processing, Edinburgh - United Kingdom, 444–454, July 27 - 31, 2011.
Mitchell, M., Dodge, J., Goyal, A., Yamaguchi, K., Stratos, K., Han, X., Mensch, A., Berg, A. Berg, H. ve Daume, H., Generating Image Descriptions from Computer Vision Detections, 13th Conference of the European Chapter of the Association for Computational Linguistics, Avignon - France, 747–756, April 2012.
Kulkarni, G., Premraj, V., Ordonez, V., Dhar, S., Li, S., Choi, Y., Berg, A.C. ve Berg, T. L., Baby talk: Understanding and Generating Simple Image Descriptions, IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(12), 2891–2903, 2013.
Ushiku, Y., Yamaguchi, M., Mukuta, Y. ve Harada, T., Common Subspace for Model and Similarity: Phrase Learning for Caption Generation from Images, IEEE International Conference on Computer Vision, Washington DC - USA, 2668–2676, December 07-13, 2015.
Ordonez, V., Kulkarni, G. ve Berg, T.L., Im2text: Describing Images Using 1 Million Captioned Photographs, Advances in Neural Information Processing Systems 24, 1143—1151, 2011.
Gupta, A., Verma, Y. ve Jawahar., C.V., Choosing Linguistics over Vision to Describe Images, AAAI Conference on Artificial Intelligence, Toronto - Canada, 606-612, July 22-26, 2012.
Farhadi, A. ve Sadeghi, M. A., Phrasal Recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence 35(12), 2854–2865, 2013.Mason, R. ve Charniak, E., Nonparametric Method for Data-Driven Image Captioning, 52nd Annual Meeting of the Association for Computational Linguistics, Baltimore - Maryland, 592–598, June, 2014.
Kuznetsova, P., Ordonez, V., Berg, T. ve Choi, Y., Tree talk: Composition and Compression of Trees for Image Descriptions, Transaction of Association for Computational Linguistics, 2 (10), 351–362, 2014.
Kalchbrenner, N. ve Blunsom, P., Two Recurrent Continuous Translation Models, ACL Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, 1700–1709, 2013.
Cho, K., van Merrienboer, B., Gülçehre, Ç., Bougares, F., Schwenk, H., ve Bengio, Y., Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation, CoRR, abs/1406.1078, 2014.
Sutskever, I., Vinyals, O. ve Quoc V. Le, Q.V., Sequence to Sequence Learning with Neural Networks, 27th International Conference on Neural Information Processing Systems (NIPS'14), 2, Editör: Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N.D. ve Weinberger, K.Q, MIT Press, Cambridge, MA, USA, 3104-3112, 2014.
Vinyals, O., Alexander Toshev, A., Bengio, S., Erhan, D., Show and Tell: A Neural Image Caption Generator, CoRR, 2014.
Hochreiter, S. ve Schmidhuber, J., Long Short-Term Memory, Neural Computation, 9(8):1735–1780, 1997.
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollar, P. ve Zitnick, C.L., Microsoft COCO: Common Objects in Context, Computer Vision, Springer International Publishing, ECCV 2014, Zurich - Switzerland, 740—755, September 6-12, 2014.
Kiros, R., Salakhutdinov, R. ve Zemel, R., Multimodal Neural Language Models, 31st International Conference on Machine Learning, Proceedings of Machine Learning Research (PMLR), 32(2), 595-603, 2014.
Kiros, R., Salakhutdinov, R. ve Zemel, R.S., Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models, CoRR, abs/1411.2539, 2014.
Mao, J., Xu, W., Yang, Y., Wang, J. ve Yuille, A.L., Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN), 3rd International Conference on Learning Representations (ICLR), San Diego - CA - USA, May 7-9, 2015.
Hodosh, M., Young, P. ve Hockenmaier, J., Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics, Journal of Artificial Intelligence Research 47, 853-899, 2013.
Young, P., Lai, A., Hodosh, M. ve Hockenmaier, J., From Image Descriptions to Visual Denotations: New Similarity Metrics for Semantic Inference over Event Descriptions, TACL,2, 67-78, 2014.
Socher, R., Karpathy, A., Le, Q.V., Manning, C.D. ve Ng, A., Grounded Compositional Semantics for Finding and Describing Images with Sentences, Transactions of the Association for Computational Linguistics, 2, 207–218, 2014.
Donahue, J., Hendricks, L.A., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K. ve Darrell, T. Long-Term Recurrent Convolutional Networks for Visual Recognition and Description, IEEE Conference on Computer Vision and Pattern Recognition, 2625–2634, 2015.
Karpathy, A. ve Fei-Fei, L., Deep Visual-Semantic Alignments for Generating Image Descriptions, IEEE Trans. Pattern Anal. Mach. Intell. 39(4), 664-676, April 2017.
Jia, X., Gavves, E., Fernando, B. ve Tuytelaars, T., Guiding the Long-Short Term Memory Model for Image Caption Generation, IEEE International Conference on Computer Vision, 2407–2415, 2015.
Yang, Z., Yuan, Y., Wu, Y., Cohen, W.W. ve Salakhutdinov, R.R., Review Networks for Caption Generation, Advances in Neural Information Processing Systems 29 (NIPS2016_6167), Editör: Lee D.D., Sugiyama, M., Luxburg, U.V., Guyon, I. ve Garnett, R., 2361—2369, 2016.
Xu, K., Lei Ba, J., Kiros, R., Cho, K., Courville, A., Salakhutdinov, R., Zemel, R.S. ve Bengio, Y, Show, Attend and Tell: Neural Image Caption Generation with Visual Attention, 32nd International Conference on Machine Learning - Volume 37 (ICML'15), 37, Editör: Bach, F. ve David Blei, D, JMLR.org 2048-2057, 2015.
Park, C.C., Kim, B. ve G. Kim, G., Attend to You: Personalized Image Captioning with Context Sequence Memory Networks, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu - HI, 6432-6440, 2017.
Tavakoli, H.R., Shetty, R., Borji, A. ve Laaksonen, J., Paying Attention to Descriptions Generatedby Image Captioning Models, IEEE Conference on Computer Vision and Pattern Recognition, 2506-2515, 2017.
Liu, C., Mao, J., Sha, F., ve Yuille, A.L., Attention Correctness in Neural Image Captioning, 31st AAAI Conference on Artificial Intelligence (AAAI'17), AAAI Press, 4176–4182, 2017.
Chen, L., Zhang, H., Xiao, J., Nie, L., Shao, J. ve Chua, T.S., SCA-CNN: Spatial and Channel-wise Attention in Convolutional Networks for Image Captioning, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 6298–6306, 2017.
Lu, J., Xiong, C., Parikh, D. ve Socher, R., Knowing When to Look: Adaptive Attention via Avisual Sentinel for Image Captioning, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 3242–3250, 2017.
You, Q., Jin, H., Wang, Z., Fang, C. ve Luo, J., Image Captioning with Semantic Attention, IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas - NV, 4651–4659, 2016.
Yao, T., Pan, Y., Li, Y., Qiu, Z. ve Tao Mei, T., Boosting Image Captioning with Attributes, IEEE International Conference on Computer Vision (ICCV), Venice - Italy, 4904–4912, 2017.
Shetty, R., Rohrbach, M., Hendricks, L.A., Fritz, M. ve Schiele, B., Speaking the SameLanguage: Matching Machine to Human Captions by Adversarial Training, IEEE International Conference onComputer Vision (ICCV), Venice - Italy, 4155–4164, 2017.
Dai, B., Lin, D., Urtasun, R. ve Fidler, S., Towards Diverse and Natural Image Descriptions via a Conditional GAN, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu - HI, 2989–2998, 2017.
Aneja, J., Deshpande, A. ve Schwing, A.G., Convolutional Image Captioning, IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City - UT, 5561–5570, 2018.
Wang, Q. ve Chan, A.B., {CNN+CNN:} Convolutional Decoders for Image Captioning, CoRR, abs/1805.09019, 2018.
Unal, M.E., Citamak, B., Yagcioglu, S., Erdem, A., Erdem, E., Cinbis, N.I. ve Cakici, R., TasvirEt: A Benchmark Dataset for Automatic Turkish Description Generation from Images, 24th Signal Processing and Communication Application Conference (SIU), Zonguldak - Turkey, 2016.
Samet, N., Hiçsönmez, S., Duygulu, P. ve Akbas, E., Görüntü Altyazılama için Otomatik Tercümeyle Egitim Kümesi Olusturulabilir mi? Could We Create A Training Set For Image Captioning Using Automatic Translation? 25th Signal Processing and Communications Applications Conference (SIU), Antalya-TR, 2017.
Kuyu, M., Erdem, A., ve Erdem, E., Image Captioning in Turkish with Subword Units, 26th Signal Processing and Communications Applications Conference (SIU), Izmir-TR, 2018, 1-4, 2018.
Yüksek, Y. ve Karasulu, B., Coklu Ortam Ontolojilerini Kullanan Anlamsal Video Analizi Üzerine bir İnceleme (A Review on Semantic Video Analysis Using Multimedia Ontologies), Gazi Üniv. Müh. Mim. Fak. Dergisi, 25(4), 719-739, 2010.