Zeki doküman dili sınıflandırma ve web tabanlı çeviri sistemi

İletişim teknolojilerinin gelişmesine paralel olarak küresel boyutta bilginin erişimi, paylaşılması, farklı dillere çevrilmesi ve bireylerin ihtiyaçlarına uygun kullanılmasının sağlanması maksadıyla içerik dilinin bilinmesi veya internet ortamında yayınlanan kaynakların dillerinin bulunması ve bir dile çevirisinin yapılması büyük önem arz etmektedir. Bu çalışmada web tabanlı olarak geliştirilen bir yazılım platformu sayesinde, içerik dili bilinmeyen Word, PDF ve HTML dokümanlarının dil içerikleri 15 farklı dil için zeki bir sistem tarafından sınıflandırılarak otomatik olarak algılanmakta ve dil çevirisi mevcut çözümler kullanılarak 64 dile otomatik olarak yapılmaktadır. Tanımlama işlemi için, yapay sinir ağları temelli yeni bir dil tanıma yöntemleri kullanılarak bu işlemler gerçekleştirilmektedir. Geliştirilen yöntem iki farklı yaklaşım ile karşılaştırılmış, dokümanların büyüklüğüne veya içeriğin niteliğine bağlı olmadan 15 farklı dilde bulunan 3 farklı doküman tipi için yüksek bir başarım göstermiştir

Anahtar Kelimeler:

Dil tanıma, dil dönüştürme, web tabanlı uygulama, yapay sinir ağı

Intelligent document language classification and web based translation system

Recent developments on information and communications technologies help globally and important to access, share, translate and the documents use easily and effectively via internet media. Language identifica tion is an important task for web information retrieval services. Automatic language identification and translation have become increasingly important, as more and more documents are being served on internet within many languages. This study presents new methods to identify web contents, containing MS Word, PDF and HTML documents in different languages and to translate them into specified languages. The identification problem can be seen as a specific instance of the more general problem of an item classification through its attributes in a limited workspace. This novel approach is based on artificial neural network model to recognize the languages. Documents content belonging to 15 languages were used in test with a new testing methodology and translating them into 64 languages automatically for language processing. The results have shown that the approaches presented in this work are very successful to meet the expectations in real - time language identification and translation accuracy and reduce the number of letters in solution space in comparison with the available two methods

Keywords:

Language identification, language translation, web based application, artificial neural network,

PDF

___

Padro M., Padro L., “Comparing Methods for Language Identification” Procesamiento del Lenguaje Natural, Barcelona, 33-35 (2004).
Botha G.R., Zimu V.Z., Barnard E., “Text-based language identification for the South African languages”, SAIEE Africa Research Journal, Cape Town, 141-146 (2007).
El-Shishiny H., Troussov A., McCloskey DJ., Takeuchi M., Nevidomsky A., Volkov P., “Word Fragments Based Arabic Language Identification”, NEMLAR Conference on Arabic Language Resources and Tools, MÝsÝr, 23-26 (2004).
Kruengkrai C., Srichaivattana P., Sornlertlamvanich V., Isahara H., "Language Identification Based on String Kernels" Communications and Information Technology, Pekin, 896-899 (2005).
Zavarsky P., Wada S., Mikami Y.,”Language and Encoding Scheme Identification of Extremely Large Sets of Multilingual Text Documents”, The 10th Machine Translation Summit, Puket, 354-355 (2005).
Peng F., Schuurmans D., Wang S.,"Language and Task Independent Text Categorization with Simple Language Models", North American Chapter of the Association for Computational Linguistics - Human Language Technologies, Edmonton, 110-117 (2003).
Nair A.S., Nair V. V., Chandra V. S. S., "Hidden Markov Model Based Identification of Transliterated Regional Language International Joint Conference on Artificial Intelligence, Haydarabad, 87-91 (2007). Text Documents", Twentieth
Ahmed B., Cha S-H.,, Tappert C.,"Language Identification from Text Using N-gram Based Cumulative Frequency Addition", Student/Faculty Research Day, New York, 121- 128 (2004). 9. Constable P.G., "Toward a Model for Language Identification", Summer Institute of Linguistics International Working Papers”, Dublin (2002).
Adams G., Resnik P., "A Language Identification Application Built on the Java Client/Server Platform", The European Chapter of the Association of Computaional Linguistics Workshop, İspanya (1997).
Ölveck T., "N-Gram based Statistics Aimed. at Language Identification", Student Research Conference in Informatics and Information Technologies, Brastilava, 1-7 (2005).
Bilcu, E.B., Astola J., “A Hybrid Neural Network for Language Identification from Text”, Machine. Learning for Signal Processing Conference, Maynooth, 253-258 (2006).
Liu Y-H., Chang F., Lin C-C., "Language Identification of Character Images Using Machine Learning Techniques", International Conference on Document Analysis and Recognition, Seul, 630-634 (2005).
Zhu G., Yu X.,Li Y., Doermann D., "Unconstrained Language Identification Using A Shape Codebook", The 11th International Conference on Frontiers in Handwritting Recognition, Montreal, 13-18 (2008).
Baykan E., Henzinger M., Weber I., "Web Page Language Identification Based on URLs", International Conference on Very Large Data Bases, Auckland, 176-187 (2008).
SağÝroğlu, Ş., Beşdok, E., Erler, M., “Mühendislikte Yapay Zeka UygulamalarÝ-1:Yapay Sinir AğlarÝ”, Ufuk Kitabevi, Kayseri, 10-100 (2003).
SağÝroğlu Ş., Yavanoğlu U., Güven E.N., “Web Based Machine Translation” International Conference on Machine Language
Identification and Bileşeni”
İnternet : Apache YazÝlÝm “Adobe Reader PDF Otomasyon http://incubator.apache.org/pdfbox/ (2008). Bileşeni”
U. Yavanoğlu, “Web TabanlÝ Otomatik Dil TanÝma ve Çeviri Sistemi Geliştirilmesi”, Gazi Üniversitesi Fen Bilimleri Enstitüsü Yüksek Lisans Tezi, 2009.
U. Yavanoğlu ve Ş. SağÝroğlu, “Web TabanlÝ Otomatik Dil TanÝma ve Çevirme Sistemi”, Gazi Üniversitesi Mühendislik-MimarlÝk Fakültesi Dergisi, Cilt:25, No:3, s.484-494, 2010.
H.P Combrinck and E.C. Botha, “Text-Based Automatic Language Identification”, Proceedings of the 6th Annual Symposium of the Pattern Recognition Association of South Africa, Gauteng, South-Africa, November, 1995.
Patent: Web ortamÝnda bulunan dokümanlarÝn yazÝ dilinin otomatik olarak tespiti ve içeriğin gerçek zamanlÝ olarak dönüştürülmesi yöntemi ve sistemi, Türk Patent Enstitüsü, Başvuru No: 2010/00137.