Emre KARAGÖZ, Kutan KORUYAN

Bulut Tabanlı Bilgisayarlı Görü Kullanılarak Sesli Betimleme Sistem Tasarımı

Multimedya araçlarındaki gelişim ve değişimler hayatın birçok alanında aktif şekilde kullanılmakta ve büyük oranda artı değer kazandırmaktadır. Yapay zekâ kavramının son derece gelişmiş olduğu günümüzde, özellikle engelli bireylerin yaşam standartlarını destekleyecek yüzlerce uygulama ve metot bulunmaktadır. Bu çalışmada geliştirilen sistem özellikle görme engelli bireylerin izledikleri film, belgesel gibi video formatındaki medya çıktı sahnelerinin görüntü imgeleme tekniği sayesinde otomatik olarak betimlenmesini ve sonuçların kullanıcılara sesli olarak aktarılmasını sağlamaktadır. Sistemin görselleştirilmesinde HTML5 ve CSS, programlanmasında PHP ve JAVASCRIPT dilleri kullanılmıştır. Sistemin veritabanı olarak MySQL tercih edilmiştir. Yapay zekâ ve bilişim teknolojilerinden olan bilgisayarlı görü, metinden konuşmaya çevirme ve bir dilden başka bir dile çeviri, bu çalışmada kullanılan temel enstrümanlardır. Görüntü imgeleme işlemleri için bulut tabanlı Microsoft AZURE Computer Vision API, metinden sese çevirme için Javascript Responce.js kütüphanesi, bir dilden başka bir dile çeviri işlemlerinde ise Google Cloud Text-To-Speech ve Microsoft Azure Text to Speech API’leri kullanılmıştır.

Anahtar Kelimeler:

Sesli Betimleme, Bilgisayarlı Görü, Metinden Konuşmaya Çeviri, Bulut Bilişim, Makina Çevirisi, Bulut Bilişim

Design of Audio Description System Using Cloud Based Computer Vision

Developments and changes in multimedia tools are actively used in many areas of life and bring a huge value to them. Nowadays, the concept of artificial intelligence is highly developed and there are hundreds of practices and methods to support the living standards especially for people with disabilities. The system developed in this study enables automatic visualization of the media output scenes such as movies, documentaries, etc., which are visually impaired people by means of computer vision technique, and the results are transferred to the users by voice command. HTML5 and CSS are used for visualizing the system, PHP and JAVASCRIPT are used for programming. MySQL is preferred as the database of the system. Computer vision, translation from text to speech and translation from one language to another are the main instruments used in this study. Cloud-based Microsoft AZURE Computer Vision API is used for computer vision, Javascript Responce.js library is used for text-to-speech translation, Google Cloud Text-To-Speech and Microsoft Azure Text to Speech APIs are used for translation from one language to another one.

Keywords:

Audio Description, Computer Vision, Text to Speech Translation, Machine Translation, Cloud Computing,

PDF

___

ADI AD Guidelines Committee (2003), Guidelines for Audio Description, Retrieved June 23, 2019, from http://www.acb.org/adp/guidelines.html, 23.06.2019.
Aslan, E. (2018). Otomatik Çeviri Araçlarının Yabancı Dil Öğretiminde Kullanımı: Google Çeviri Örneği, Selçuk Üniversitesi Edebiyat Fakültesi Dergisi, 0(39), 87-104.
Aydemir, E. (2018). Weka ile Yapay Zeka, Ankara: Seçkin Yayıncılık.
Benecke, B. (2004). Audio-Description, Meta, 49 (1), 78–80.
Carvalho, P., Trancoso, I.M., & Oliveira, L.C. (1998). Automatic Segment Alignment for Concatenative Speech Synthesis in Portuguese, Proc. of the 10th Portuguese Conference on Pattern Recognition, RECPAD'98, Lisbon.
Dawson-Howe, K. (2014). A Practical Introduction to Computer Vision with OpenCV, John Wiley & Sons.
Delgado, H., Matamala, A. & Serrano, J. (2015). Speaker diarization and speech recognition in the semi-automatization of audio description: An exploratory study on future possibilities?, Cadernos de Tradução, 35(2), 308-324.
Gagnon, L., Chapdelaine, C., Byrns, D., Foucher, S., Heritier, M. & Gupt, V. (2010). A computer-vision-assisted system for Videodescription scripting. 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops, San Francisco, CA, USA, 1-8.
Google Cloud, (n.d.). Cloud Translation, Retrieved May 15, 2019, https://cloud.google.com/translate/
Jang, I., Ahn, C. & Jang, Y. (2014). Semi-automatic DVS Authoring Method. Computers Helping People with Special Needs: 14th International Conference, ICCHP 2014. Springer International Publishing, Switzerland.
Klancnik, S., Ficko, M., Balic, J. & Pahole, I. (2015). Computer Vision-Based Approach to End Mill Tool Monitoring, International Journal of Simulation Modelling, 14(4), 571–583.
Krishna, R. (2017). Computer Vision: Foundations and Applications, Stanford: Stanford University.
Lakritz, J. & Salway, A. (2006). The semi-automatic generation of audio description from screenplays, Dept. of Computing Technical Report, University of Surrey, UK.
Microsoft Azure, (n.d.). Translator Text API, Retrieved June 26, 2019, https://azure.microsoft.com/en-gb/services/cognitive-services/translator-text-api
Nabiyev, V.V. (2016). Yapay Zeka (5th ed.), Ankara: Seçkin Yayıncılık.
Netflix, Netflix Audio Description Style Guide v2.1, Retrieved November 12, 2019, https://partnerhelp.netflixstudios.com/hc/en-us/articles/215510667-Audio-Description-Style-Guide-v2-1.
O'Malley, M. H. (1990). Text-to-speech conversion technology, Computer, 23(8), 17-23.
Pagani, M. (2005). Encyclopedia of multimedia technology and networking, Hershey PA, USA: Idea Group Inc.
Remael, A., Reviers, N. & Vercauteren, G. (n.d.). ADLAB Audio Description guideline, Retrieved June 24, 2019, http://www.adlabproject.eu/Docs/adlab%20book/index.html.
Rohrbach, A., Torabi, A., Rohrbach, M., Tandon, N., Pal, C., Larochelle, H., Courville, A. & Schiele, B. (2017). Movie Description, International Journal of Computer Vision, 123(1), 94–120.
Whitehead, J. (2015). What is audio description. International Congress Series, 1282, 960-963.