Derin Sinir Ağı ve Word2Vec Tabanlı Çok Sınıflı DokümanSınıflandırma

Yapısal olmayan verilerin artmasıyla birlikte metin tabanlı belgelerin sınıflandırılmasının önemi artmıştır. Özellikle haber metinlerinin sınıflandırılması ve dijital dokümantasyon, aranan bilgilere kolay erişim sağlar. Bu çalışmada, büyük miktarda metinsel haber verisi kullanılmıştır. Veri seti ön işlemeye tabi tutulduktan sonra, Bag of Words (BoW), TF-IDF, Word2Vec ve Doc2Vec kelime temsil yöntemleri uygulanmıştır. Sınıflandırma aşamasında Random Forest (RF), Multilayer Perceptron (MLP), Support Vector Machine (SVM) ve Deep Neural Network (DNN) algoritmaları uygulanmıştır. Deneysel çalışmalar sonucunda DNN algoritması ile birlikte Word2Vec yönteminin kullanılması en iyi sonucu vermiştir.

Multi-Class Document Classification Based on Deep Neural Network and Word2Vec

With the increase in unstructured data, the importance of classification of text-based documents has increased. In particular, the classification of news texts and digital documentation provides easy access to the information sought. In this study, a large amount of news textual data was used. After the data set was preprocessed, Bag of Words (BoW), TF-IDF, Word2Vec and Doc2Vec word embedding methods were applied. In the classification phase, Random Forest (RF), Multilayer Perceptron (MLP), Support Vector Machine (SVM) and Deep Neural Network (DNN) algorithms were applied. As a result of the experimental studies, using the Word2Vec method together with the DNN algorithm performed the best result.

___

  • [1] “World Internet Statistics,” [Online]. Available: https://www.internetworldstats.com/stats.htm, [Accessed: May 23, 2019].
  • [2] M. Kaytan, D. Hanbay, “Effective Classification of Phishing Web Pages Based on New Rules by Using Extreme Learning Machines.” Anatolian ScienceJournal of Computer Sciences, vol. 2, no. 1, pp. 15-36, 2017.
  • [3] N. Indurkhya, F.J. Damerau, Handbook of Natural Language Processing. Chapman & Hall/CRC, 2010. [4] C. D. Manning, P. Raghavan, H. Schütze, Introduction to information retrieval. Cambridge University Press, 2008.
  • [5] E. Alpaydin, Machine learning : The New AI. The MIT Press, 2016.
  • [6]J. F. De Paz, J. Bajo, V.F. López and J.M. Corchado, “Biomedic organizations: an intelligent dynamic architecture for KDD”, Information Sciences, vol. 224, pp. 49–61, March 2013.
  • [7] V. Vapnik, The Nature of Statistical Learning Theory, 2000, pp. 1-15.
  • [8] G. Şahin, "Turkish document classification based on Word2Vec and SVM classifier," 25th Signal Processing and Communications Applications Conference (SIU), 2017, pp. 1-4.
  • [9] T. Mikolov, K. Chen, G. Corrado and J. Dean, “Efficient estimation of word representations in vector space”. Proceedings of Workshop at ICLR. Scottsdale, Arizona 2-4 May 2013.
  • [10] L. K.Şenel, V. Yücesoy, A. Koç, T. Çukur, “Interpretability Analysis for Turkish Word Embeddings,” 26th Signal Processing and Communications Applications Conference (SIU), 2018 pp.1-4.
  • [11] B. K. Loni, H. Seyedeh and P. Wiggers, “Latent semantic analysis for question classification with neural networks,” 2011 IEEE workshop on automatic speech recognition and understanding (ASRU), 2011, pp. 437–442.
  • [12] M. Gogoi, S. K. Sharma “Document Classification of Assamese Text Using Naïve Bayes Approach,” International Journal of Emerging Trends & Technology in Computer Science, vol. 30, no. 4, pp. 182-186, 2015.
  • [13] W. Arshad, M. Ali, M. Mumtaz Ali, A. Javed and S. Hussain, "Multi-Class Text Classification: Model Comparison and Selection," 2021 International Conference on Electrical, Communication, and Computer Engineering (ICECCE), 2021, pp. 1-5.
  • [14] S. Joshi and E. Abdelfattah, "Multi-Class Text Classification Using Machine Learning Models for Online Drug Reviews," 2021 IEEE World AI IoT Congress (AIIoT), 2021, pp. 0262-0267.
  • [15] A. Anand “AG News Classification Dataset,” kaggle.com, [Online]. Available: https://www.kaggle.com/amananandrai/ag-newsclassification-dataset?select=train.csv [Accessed: Sept. 20, 2021].
  • [16] Z. S. Harris, “Distributional structure,” Word, vol. 10, no. 2-3, pp. 146-162, 1954.
  • [17] J. Ramos, “Using tf-idf to determine word relevance in document queries,” In Proceedings of the first instructional conference on machine learning 2003, vol. 242, pp. 133-142.
  • [18] Q. Le, T. Mikolov, "Distributed Representations of Sentences and Documents," International Conference on Machine Learning 2014
  • 19] L. Chen, G. Feng, C. W. Leong, B. Lehman, M. M. Raugh, H. Kell, C. M. Lee and S. Y. Yoon, “Automated scoring of interview videos using Doc2Vec multimodal feature extraction paradigm,” 2016 Proceedings of the 18th ACM International Conference on Multimodal Interaction, ACM.
  • [20] D.W. Kim and M.W. Koo, “Categorization of Korean News Articles Based on Convolutional Neural Network Using Doc2Vec and Word2Vec,” Journal of KIISE, vol. 44, no. 7, pp. 742-747, 2017.
  • [21] D. A. Demirci, Vektör makineleri ile karakter tanıma. MSc [Dissertation]. Istanbul: Yıldız Technical University, 2007.
  • [22] G. Panchal, A. Ganatra, Y. P. Kosta, and D. Panchal, “Behaviour Analysis of Multilayer Perceptrons with Multiple Hidden Neurons and Hidden Layers,” International Journal of Computer Theory and Engineering, vol. 3, no. 2, pp.332-337, 2011.
  • [23] G. E. Hinton, R. R. Salakhutdinov, “Reducing the Dimensionality of Data with Neural Networks,” Science, vol. 313, pp. 504-507, 2006.
  • [24] J. Huang, J. Zheng, S. Gao, W. Liu, and J. Lin, “Grid text classification method based on DNN neural network,” In MATEC Web of Conferences 2020, vol. 309, p. 03016.