1919

N-Gram Yaklaşımı Kullanılarak Fatura Görüntülerinden Bilgi Çıkarımında Farklı Sınıflandırma Algoritmalarının Karşılaştırılması

Yapay Zeka (AI) günümüzde birçok alanda kullanılmaya başlanmıştır. Bu alanlardan biri de muhasebe sektörüdür. Özellikle büyük firmaların yoğun faturalama işlemleri karşısında muhasebe firmaları bazen yetersiz kalabilmektedir. Bu sorun, faturaların Yapay Zeka destekli bir sistemle işlenmesi ihtiyacını ortaya çıkarmıştır. Bu çalışmanın amacı, fatura görüntü dosyalarından fatura numarası, fatura tarihi, vade bitiş tarihi, teslim tarihi, toplam brüt, toplam net, kdv tutarı ve IBAN gibi bilgileri çıkarmak için en iyi makine öğrenme modelini belirlemektir. Çalışmada, Tesseract Optik Karakter Tanıma sistemi ile elde edilen bilgiler n-gram formatına dönüştürülmüştür. N-gramların koordinatları, uzunluk, genişlik, satır numarası gibi şablon bilgileri, aday n-gramlar ile kontrol anahtar kelimeler listesindeki anahtar kelimeler arasındaki Levenshtein ve Jaro-Winkler mesafeleri gibi bir dizi öznitelikleri hesaplanmıştır. Aday n-gramlar ile kontrol anahtar kelimeler arasındaki Levenshtein mesafesinin kullanılması, yeterince yüksek bir tahmin oranıyla sonuçlanmıştır. Eğitim için en uygun model ve özellikler belirlenmiştir. Tahmin modelleri olarak Rassal Orman (Random Forest), Gradyan Yükseltme Makinesi (Gradient Boosting Machine), Aşırı Gradyan Yükseltme (Extreme Gradient Boosting), K-En Yakın Komşu (K-Nearest Neighbors), AdaBoost ve Karar Ağacı (Decision Tree) gibi algoritmalar karşılaştırılmıştır. Çeşitli firmalardan toplanan 9910 adet fatura, %80’i eğitim ve %20’si test olacak şekilde bölünerek kullanılmıştır. Levenshtein mesafesini kullanan Rassal Orman modelinin ortalama 0,9137 olan F1 puanı ile en iyi model olduğu görülmüştür.

Anahtar Kelimeler:

Makine öğrenimi, Bilgi çıkarımı, N-gram, Levenshtein uzaklığı, Jaro-Winkler uzaklığı

Comparison of Different Classification Algorithms for Extraction Information from Invoice Images Using an N-Gram Approach

Artificial intelligence (AI) has started to be used in many areas today. One of these areas is the accounting sector. Accounting companies may sometimes be inadequate especially in the face of intense invoicing transactions of large companies. This problem raised the need to process invoices by an Artificial Intelligence powered system. The goal of this work is to determine the best machine learning model to extract information such as invoice number, invoice date, due date, delivery date, total gross, total net, vat amount and IBAN from the invoice image files. Information obtained by the Tesseract Optical Character Recognition (OCR) system has been converted into n-gram format. A number of attributes of the n-gram are calculated such as the coordinates, the length, the width, the line number, the template information of n-grams, the Levenshtein and the Jaro-Winkler distances between the candidate n-grams and the keywords in the control keywords list. The use of the Levenshtein distance between candidate n-grams and the control keywords has resulted in a sufficiently high predictive rate. The most appropriate model and features are determined for the training. Algorithms such as Random Forest, Gradient Boosting Machine, Extreme Gradient Boosting, K-Nearest Neighbors, AdaBoost and Decision Tree were compared as prediction models. A total of 9910 invoices were used by splitting 80% for training and 20% for testing. It was observed that the Random Forest model using the Levenshtein distance is the best model with an average F1 score of 0.9137.

Keywords:

Machine learning, Information extraction, N-gram, Levenshtein distance, Jaro-Winkler distance,

PDF

___

Aydın C (2018) Makine Öğrenmesi Algoritmaları Kullanılarak İtfaiye İstasyonu İhtiyacının Sınıflandırılması. Avrupa Bilim ve Teknoloji Dergisi. 14(4):169–175.
Breiman L (2001) Random Forests. Machine learning 45(1):5–32.
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: Synthetic Minority Oversampling Technique. Journal of Artificial Intelligence Research 16:321–357.
Esser D, Schuster D, Muthmann K, Berger M, Schill A (2012) Automatic indexing of scanned documents: a layout-based approach. Document Recognition and Retrieval XIX, 8297, 82970H.
Gelfand SB, Ravishankar CS, Delp EJ (1991) An iterative growing and pruning algorithm for classification tree design. IEEE Transaction on Pattern Analysis and Machine Intelligence 13(2):163-174.
Haldar R, Mukhopadhyay D (2011) Levenshtein distance technique in dictionary lookup methods: an improved approach. https://arxiv.org/abs/1101.1232. Accessed 23.05.2020.
Jaro A (1989). Advances in record-linkage methodology as applied to matching the 1985 census of tampa. Journal of the America Statistical Association. 84(406):414-420.
Katti A, Reisswig C, Guder C, Brarda S, Bickel S, Hohne J, Faddoul J (2018) Chargrid: Towards understanding 2d documents. Conference on Empirical Methods in Natural Language Processing. Brussels, Belgium, pp 4459-4469.
Klein B, Agne S, Dengel A (2004) Results of a Study on Invoice-Reading Systems in Germany. In: Marinai S, Dengel AR (eds.) Document Analysis Systems VI, ser. Lecture Notes in Computer Science. Springer Berlin Heidelberg, pp 451–462.
Liu X, Gao F, Zhang Q, Zhao H (2019) Graph convolution for multimodal information extraction from visually rich documents. Proceedings of the 2019 Conference of the North. Minnesota.
Mashat A, Fouad M, Yu P, Gharib T (2012) A decision tree classification model for university admission system. International Journal of Advanced Computer Science and Applications 3(10):17–21.
Nasiboğlu R, Akdoğan A (2020) Estimation of the secondhand car prices from data extracted via web scraping techniques. Journal of Modern Technology and Engineering. 5(2): 157-166.
Palm R, Winther O, Laws F (2017) CloudScan - A configuration-free invoice analysis system using recurrent neural networks. 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Kyoto, Japan, pp 406 – 413.
Quinlan J (1986) Induction of decision trees. Machine learning 1(1):81–106.
Schulz K, Mihov S (2002) Fast string correction with Levenshtein-automata. International Journal of Document Analysis and Recognition 5(1):67–85.
Schuster D, Muthmann K, Esser D, Schill A, Berger M, Weidling C, Aliyev K, Hofmeier A (2013) Intellix – end-user trained information extraction for document archiving. 12th International Conference on Document Analysis and Recognition, pp 101–105.
Smith R (2007) An overview of the Tesseract OCR engine. Document Analysis and Recognition, ICDAR 2007. Ninth International Conference, pp 629–633.
Wang Y, Qin J, Wang W (2017) Efficient Approximate Entity Matching Using Jaro-Winkler Distance. 18th International Conference on Web Information Systems Engineering. Puschino, Russia, pp 231-239.
Watanabe T, Tsukada H, Isozaki H (2009) A succinct n-gram language model. International Joint Conference on Natural Language Processing (IJCNLP). Singapore, pp 341–344.
Xiaoliang Z, Jian W, Hongcan Y, Shangzhuo W (2009) Research and Application of the improved Algorithm C4.5 on Decision Tree. International Conference on Test and Measurement (ICTM). Hong Kong, 2:184-187.
Yıldız İ, Karadeniz A (2019) Enhancement of Breast Cancer Diagnosis Accuracy with Deep Learning. European Journal of Science and Technology. (Special Issue):452-462.
Zelic F, Sable A (2020) A comprehensive guide to OCR with Tesseract, OpenCV and Python. https://nanonets.com/blog/ocr-with-tesseract/. Accessed 02.06.2020.