Ahmet SAYAR, Süleyman EKEN, Burak ATAY, Büşra Ceren SÖNMEZ

DocDig: Dijitalleştirilmiş Dokümanlarda İçerik Tabanlı Figür Arama

Örüntü tanıma psikolojiden biyometriye, biyoenformatikten gen ifadelerinin analizine, trafikten hesaplamalı finansa kadar birçok alanda kullanılmaktadır. Optik Karakter Tanıma da bu alanlardan bir tanesidir. Kamu ve özel birçok firma, arşivlerindeki klasörlenmiş verilerini taratarak dijital hale getirmekte ve bunun için emek yoğun çalışmalar yapmaktadır. Ancak resim olarak dijitalleştirilen bu verilerin içerik olarak aranması ve işlenmesi ancak operatörlerin manuel olarak taranan resim verisine meta veri eklemesi ile kısmi olarak gerçekleşmektedir. Bu çalışmada, resim olarak taranarak (eng. scan) ve dijital hale getirilen büyük miktarlardaki bu dokümanlar üzerinde içerik bazlı figür aramaları mümkün kılan bir mimari geliştirdik. Kullanıcı, bazı anahtar kelimelerle arama yaparak dijital dökümanlardaki ilgili figürleri başlıklarıyla beraber görüntüleyebilmektedir. Sistemin yapılabilirlik ve başarımı farklı veri setleri üzerinde test edilmiş, başarılı sonuçlar elde edilmiştir.

Anahtar Kelimeler:

Doküman dijitalleştirme, figür/resim saptama, başlık saptama, içerik tabanlı arama

DocDig: Content Based Figure Search in Digitized Documents

Pattern recognition is used in many areas, from psychology to biometrics, analysis of gene expressions from bioinformatics, from traffic to finance calculated. Optical Character Recognition is also one of these areas. Many public and private firms digitize their archived data and make labor-intensive studies for this purpose. However, the retrieval and processing of these data, which are digitized as images, is only partially realized by adding metadata to the manually scanned image data. In this work, we developed an architecture that makes contentbased figure searches possible on these scanned documents in large quantities. The user can search with some keywords and display related figures in digital documents with their captions. The feasibility and performance of the system have been tested on different data sets and successful results have been obtained.

Keywords:

Document digitization Figure/picture detection, Caption detection, Content based search, MongoDB,

PDF

___

[1] K. Jung, K. I. Kim ve A. K. Jain, “Text information extraction in images and video: A survey,” Pattern Recognition, vol. 37, no. 5, pp. 977–997, 2004.
[2] C. Patrick, C. Francine ve D. Laurent “Picture Detection in Document Page Images,” ACM Symposium on Document Engineering, Manchester, United Kingdom, 2010, pp. 211–214.
[3] S. B. Dan ve R. C. Francine, “Extraction of text-related features for condensing image documents,” SPIE 2660, Document Recognition III, San Jose, CA, United States, 1996, pp. 72–88.
[4] L. A. Fletcher ve R. Kasturi “A robust algorithm for text string separation from mixed text/graphics images,” IEEE TPAMI, vol. 10, no. 6, pp. 910–918, 1988.
[5] C. Najwa-Maria, D. Pascal ve Y. Charles, “A Robust Algorithm for Text Extraction from Images,” 39th International Conference on Telecommunications and Signal Processing, Vienna, Austria, 2016, pp. 493–497.
[6] Y. Vikas ve R. Nicolas, “Text extraction in document images: highlight on using corner points,” 12th IAPR Workshop on Document Analysis Systems, Santorini, Greece, 2015, pp. 281–286.
[7] F. Shafait, D. Keysers ve T. M. Breue, “Performance evaluation and benchmarking of six page segmentation algorithms,” IEEE TPAMI, vol. 10, no. 6, pp. 941–954, 2008.
[8] T. J. Burns ve J. J. Corso, “Robust unsupervised segmentation of degraded document images with topic models,” Computer Vision and Pattern Recognition, Miami, FL, USA, 2009, pp. 1287–1294.
[9] S. Chuai-Aree, C. Lursinsap, P. Sophatsathit ve S. Siripant, “Fuzzy C-Mean: A Statistical Feature Classification of Text and Image Segmentation Method,” International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, vol. 9, no. 6, pp. 661–671, 2001.
[10] A. Srivastav veJ. Kumar, “Text detection in scene images using stroke width and nearest-neighbor constraints,” TENCON 2008, Hyderabad, India, 2008, pp. 1–5.
[11] M. Jaderberg, A. Vedaldi ve A. Zisserman, “Deep Features for Text Spotting,” 13th European Conference on Computer Vision, Zurich, Switzerland, 2014, pp. 6–12.
[12] J. Shi ve J. Malik, “Normalized cuts and image segmentation,” IEEE TPAMI, vol. 22, no. 8, pp. 431–439, 2000.
[13] T. Wang, D. J. Wu, A. Coates, ve A. Y. Ng, “End-to-end text recognition with convolution neural networks,” 21st International Conference on Pattern Recognition, Tsukuba, Japan, pp. 3304–3308, 2012.
[14] Y. Zhu, J. Sun ve S. Naoi, “Recognizing natural scene characters by convolutional neural network and bimodal image enhancement,” International Workshop on Camera-Based Document Analysis and Recognition, Beijing, China, 2011, pp. 69–82.
[15] Tess4J, (17Haziran 2017) [Online]. Erişim: https://github.com/tesseract-ocr/tesseract
[16] E. Süleyman, K. G. Fidan, S. Ahmet ve K. Adnan, “Doküman Tabanlı NoSQL Veritabanları: MongoDB ve CouchDB yatay ölçeklenebilirlik karşılaştırması,” 7. Mühendislik ve Teknoloji Sempozyumu, Ankara, Türkiye, pp. 1-7, 2014.