Süleyman EKEN, Alper KANTARCI, Ahmet SAYAR

Dijital Dokümanlar Üzerinde Otomatik Biçim Tanıma ve Farklı İçeriklere Uyarlama: Özgeçmişler Üzerinde Durum Çalışması

Çoğu bilgisayar işleminin merkezinde yer alan toplu kategorizasyona ilişkin olarak bilgi geri çağırmayı etkileyen iki tür ilgili veri vardır:yapısal veriler ve yapılandırılmamış veriler. Yapılandırılmış veriler, ilişkisel bir veritabanına dahil edilmesi gibi yüksek derecedeorganizasyona sahip bilgileri ifade eder. Bununla birlikte, yapılandırılmamış veriler kendi iç yapısına sahip olabilir, ancak bir e-tabloyaveya veritabanına tam olarak karşılık gelmezler. Özgeçmişler bu tür verilerdir. Genelde PDF (Portable Document Format, TaşınabilirBelge Formatı) formatında sunulan özgeçmişler, PDF etiketleme özelliği kullanılarak yapısal hale getirilebilir; fakat çoğu PDF verisietiketlenmemiş ve yapısal olmayan haldedir. Teknik olmayan iş dünyası kullanıcıları ve veri analistlerinin bu tür kapalı kutularla başaçıkmaları çok zordur.Bu çalışma kapsamında, kişilerin özgeçmiş hazırlayarak zamanlarını kaybetmemek ve farklı kabul görmüş formatlarda kişilerin kendibilgilerine göre kendilerine has özgeçmişler hazırlayabilmesine imkân verecek web tabanlı zeki özgeçmiş tasarımcısı geliştirildi. PDFdokümanlarının içerik yapısı, metin verisi ve bu verinin yazı tipi ve dokümandaki lokasyon bilgileri çıkartıldı ve elde edilen bu bilgilerokuma sırasına göre belirli yapılara dönüştürülerek önceden tanımlanmış olan XML (Extensible Markup Language, Genişletilebilirİşaretleme Dili) tabanlı özgeçmiş tasarımı oluşturuldu. Elde edilen bu tasarımlar kullanılarak kişisel PDF dökümanları oluşturuldu. PDFanalizi ve PDF oluşturma işlemi, Java iText-pdf kütüphanesi yardımıyla gerçekleştirildi. Tasarım verileri arayüz aracılığyla kullanıcıyasunularak kullanıcı istediği tasarımı kendi dökümanını oluştururken seç ve uygula yaklaşımıyla aktarabilmektedir.PDF dokümanından elde edilen şablonun XML formatında kaydedilmesi ve farklı içeriklere uyarlama aşamasında, kaydedilmiş hazırXML formatındaki şablonların kullanılması öngörüldü. XML formatındaki şablonların otomatik oluşturulabilmesi ve sonradandoğruluğunun test edilebilmesi için XSD (XML Schema Definition, XML Şeması Tanımı) tanımlandı. Geliştirilen uygulama ileözgeçmişlerin otomatik biçimlerinin tanınması ve farklı içeriklerin adaptasyonu sağlandı.

Automatic Structure Recognition on Digital Documents and Adapting to Different Contents: Case Study on Resumes

With respect to the mass categorization that is central to most computer operations, there are two types of relevant data which affect speed of assimilation as well as information recall: structured data and unstructured data. Structured data refers to information with a high degree of organization, such that inclusion in a relational database. However, unstructured data may have its own internal structure, but does not conform neatly into a spreadsheet or database. CVs (Curriculum vitae) are this kind of data. Typically, CVs presented in PDF format can be structured using the PDF tagging feature, however most PDF data is untagged and unstructured. It is very difficult for non-technical business users and data analysts to deal with such closed boxes. Within the scope of this study, a web based smart resume designer was developed which will allow people gain time while creating their own resumes according to their own information in different accepted formats. The content structure of the PDF documents, the text data and the font and location information of this data were extracted and the information obtained was converted into certain structures in the order of reading and a predefined XML based resume template was created. Personal PDF documents are created using this template. PDF analysis and PDF creation was done directly by accessing the content stream of the PDF document with the help of the iText-pdf library, which is the Java library. Presentation templates is served to end-user on a desktop applicaiton with a GUI and users can select any metadata to create own document with select-and-apply approach. It is predicted that the template obtained from the PDF document will be saved in XML format and the templates in the ready-made XML format will be used for adaptation to different contents. The XML schema (XSD-xml schema definition) is defined for the automatic creation of templates in XML format and subsequent testing of their accuracy. With the application developed, automatic forms of resumes were recognized and different contents were adapted.

PDF

___

Aiello, M., Monz, C., Todoran, L., & Worring, M. (2002). Document Understanding for a Broad Class of Documents. International Journal on Document Analysis and Recognition, 5(1), 1-16.
Altamura, O., Esposito, F., & Malerba, D. (2000). Transforming paper documents into XML format with WISDOM++. International Journal on Document Analysis and Recognition, 4(1), 2-17.
Baker, J. B., Sexton, A. P., Sorge, V., & Suzuki, M. (2011). Comparing Approaches to Mathematical Document Analysis from PDF. 2011 International Conference on Document Analysis and Recognition (s. 463-467). Beijing: IEEE. doi:10.1109/ICDAR.2011.99
Chao, H., & Fan, J. (2004). Layout and Content Extraction for PDF Documents. doi:10.1007/978-3-540-28640-0_20
Constantin, A., Pettifer, S., & Voronkov, A. (2013). PDFX: fully-automated PDF-to-XML conversion of scientific literature. 2013 ACM symposium on Document engineering (s. 177-180). New York: ACM.
Eken, S., Atay, B., Sönmez, B. C., & Sayar, A. (2018). DocDig: Dijitalleştirilmiş Dokümanlarda İçerik Tabanlı Figür Arama. Düzce Üniversitesi Bilim ve Teknoloji Dergisi, 6(1), 68-78.
Eken, S., Ekinci, E., & Sayar, A. (2014). XML Anahtar Kelimeleri Yardımıyla Türkçe Aritmetik Problemlerin Anlaşılması ve Çözülmesi. Düzce Üniversitesi Bilim ve Teknoloji Dergisi, 2(1), 48-55.
Eken, S., Karabas, A., Sarı, H., & Sayar, A. (2018). A framework for recognition and animation of chess moves printed on a chess book. Int. Arab J. Inf. Technol., 15(1), 29-36.
Eken, S., & Sayar, A. (2013). Animating Chess Moves Recorded on Chess Informant. In Proceedings of the 3rd International Symposium on Computing in Science and Engineering (pp. 35-40).
Gabdulkhakova, A., & Tamir, H. (2012). Document understanding of graphical content in natively digital PDF documents. 2012 ACM symposium on Document engineering, (s. 137-140). New York. doi:https://doi.org/10.1145/2361354.2361385 Hassan, T. (2009). Object-Level Document Analysis of PDF Files. ACM DL, 47-55.
Jiang, D., & Yang, X. (2009). Converting PDF to HTML approach based on text detection. In Proceedings of the 2nd International Conference on Interaction Sciences: Information Technology, Culture and Human (ICIS '09) (s. 982-985). New York: ACM. doi:https://doi.org/10.1145/1655925.1656103
Liu, Y., Bai, K., Mitra, P., & Giles, C. L. (2009). Improving the Table Boundary Detection in PDFs by Fixing the Sequence Error of the Sparse Lines. 10th International Conference on Document Analysis and Recognition (s. 1006-1010). Barcelona: IEEE.
Mohemad, R., Hamdan, A. R., Othman, Z. A., & Mohamad, N. M. (2011). Automatic Document Structure Analysis of Structured PDF Files. IJNCAA, 404-411.
Tunçer, M. (2013, April 9). Özgeçmiş Hazırlama Tüyoları ve CV Örneği. 12 18, 2019 tarihinde Kariyer.net: https://www.kariyer.net/kariyer-rehberi/ozgecmis-hazirlama-tuyolari-ve-cv-ornegi/ adresinden alındı