ON THE EFFECTIVENESS OF PARAGRAPH VECTOR MODELS IN DOCUMENT SIMILARITY ESTIMATION FOR TURKISH NEWS CATEGORIZATION

News categorization, which is a common application area of text classification, is the task of automatic annotation of news articles with predefined categories. In parallel with the rise of deep learning techniques in the field of machine learning, neural embedding models have been widely utilized to capture hidden relationships and similarities among textual representations of news articles. In this study, we approach the Turkish news categorization problem as an ad-hoc retrieval task and investigate the effectiveness of paragraph vector models to compute and utilize document-wise similarities of Turkish news articles. We propose an ensemble categorization approach that consists of three main stages, namely, document processing, paragraph vector learning, and document similarity estimation. Extensive experiments conducted on the TTC-3600 dataset reveal that the proposed system can reach up to 93.5% classification accuracy, which is a remarkable performance when compared to the baseline and state-of-the-art methods. Moreover, it is also shown that the Distributed Bag of Words version of Paragraph Vectors performs better than the Distributed Memory Model of Paragraph Vectors in terms of both accuracy and computational performance.

ON THE EFFECTIVENESS OF PARAGRAPH VECTOR MODELS IN DOCUMENT SIMILARITY ESTIMATION FOR TURKISH NEWS CATEGORIZATION

News categorization, which is a common application area of text classification, is the task of automatic annotation of news articles with predefined categories. In parallel with the rise of deep learning techniques in the field of machine learning, neural embedding models have been widely utilized to capture hidden relationships and similarities among textual representations of news articles. In this study, we approach the Turkish news categorization problem as an ad-hoc retrieval task and investigate the effectiveness of paragraph vector models to compute and utilize document-wise similarities of Turkish news articles. We propose an ensemble categorization approach that consists of three main stages, namely, document processing, paragraph vector learning, and document similarity estimation. Extensive experiments conducted on the TTC-3600 dataset reveal that the proposed system can reach up to 93.5% classification accuracy, which is a remarkable performance when compared to the baseline and state-of-the-art methods. Moreover, it is also shown that the Distributed Bag of Words version of Paragraph Vectors performs better than the Distributed Memory Model of Paragraph Vectors in terms of both accuracy and computational performance.

___

  • [1] Kowsari K, Jafari Meimandi K, Heidarysafa M, Mendu S, Barnes L, Brown D. Text classification algorithms: A survey. Information ,2019; 10(4): 150.
  • [2] Uysal AK, Gunal S. The impact of preprocessing on text classification. Information Processing & Management, 2014; 50(1): 104-112.
  • [3] Minaee S, Kalchbrenner N, Cambria E, Nikzad N, Chenaghlu M, Gao J. Deep learning-based text classification: a comprehensive review. ACM Computing Surveys, 2021; 54(3): 1-40.
  • [4] Skogerbø E, Winsvold M. Audiences on the move? Use and assessment of local print and online newspapers. European Journal of Communication, 2011; 26(3): 214-229.
  • [5] Le Q, Mikolov T. Distributed representations of sentences and documents. In: 31st International Conference on Machine Learning (ICML 2014); Beijing; China; 2014; pp. 1188-1196.
  • [6] Dogru HB, Tilki S, Jamil A, Hamed AA. Deep learning-based classification of new texts using doc2vec model. In: 2021 1st International Conference on Artificial Intelligence and Data Analytics (CAIDA); Riyadh, Saudi Arabia; 2021; pp. 91-96.
  • [7] Trieu LQ, Tran HQ, Tran MT. News classification from social media using twitter-based doc2vec model and automatic query expansion. In: Proceedings of the Eight International Symposium on Information and Communication Technology (SoICT 2017); Nha Trang, Vietnam; 2017; 460-467.
  • [8] Kim D, Seo D, Cho S, Kang P. Multi-co-training for document classification using various document representations: TF-IDF, LDA, and Doc2vec. Information Sciences, 2019; 477: 15-29.
  • [9] Kılınç D, Özçift A, Bozyigit F, Yıldırım P, Yücalar F, Borandag E. TT-3600: A new benchmark dataset for Turkish text categorization. Journal of Information Science, 2017; 43(2): 174-185.
  • [10] Guo B, Zhang C, Liu J, Ma X. Improving text classification with weighted word embeddings via a multi-channel TextCNN model. Neurocomputing, 2017; 363: 366-374.
  • [11] Pittaras N, Giannakopoulos G, Papadakis G, Karkaletsis V. Text classification with semantically enriched word embeddings. Natural Language Engineering, 2021; 27(4): 391-425.
  • [12] Fahfouh A, Riffi J, Mahraz MA, Yahyaouy A, Tairi H. PV-DAE: A hybrid model for deceptive opinion spam based on neural network architectures. Expert Systems with Applications, 2020; 157: 113517.
  • [13] Madisetty S, Desarkar MS. A neural network-based ensemble approach for spam detection in Twitter. IEEE Transactions on Computational Social Systems, 2018; 5(4): 973-984.
  • [14] Severyn A, Moschitti A. Twitter sentiment analysis with deep convolutional neural networks. In: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2015); Santiago, Chile; 2015; pp. 959-962.
  • [15] Tang D, Wei F, Qin B, Yang N, Liu T, Zhou M. Sentiment embeddings with applications to sentiment analysis. IEEE Transactions on Knowledge and Data Engineering 2015; 28(2): 496-509.
  • [16] Unanue IJ, Borzeshi EJ, Piccardi M. Recurrent neural networks with specialized word embeddings for health-domain named-entity recognition. Journal of Biomedical Informatics, 2017; 76: 102-109.
  • [17] Ai Q, Yang L, Guo J, Croft WB. Analysis of the paragraph vector model for information retrieval. In: Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval (ICTIR 2016); New York, USA; 2016. pp. 133-142.
  • [18] Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word representations in vector space. In: Proceedings of the 1st International Conference on Learning Representations (ICLR 2013); Scottsdale, Arizona, USA; 2013.
  • [19] Sahlgren M. The distributional hypothesis. Italian Journal of Disability Studies, 2008; 20: 33-53.
  • [20] Benedetti F, Beneventano D, Bergamaschi S, Simonini G. Computing inter-document similarity with context semantic analysis. Information Systems, 2019; 80: 136-147.
  • [21] Yürekli A, Kaleli C, Bilge A. Alleviating the cold-start playlist continuation in music recommendation using latent semantic indexing. International Journal of Multimedia Information Retrieval, 2021; 10(3): 185-198.
  • [22] Acı Ç, Çırak A. Turkish news categorization using convolutional neural networks and word2vec. Bilişim Teknolojileri Dergisi, 2019; 12(3): 219-228 (in Turkish with an abstract in English).
  • [23] Borandağ E, Özçift A, Kaygusuz Y. Development of majority vote ensemble feature selection algorithm augmented with rank allocation to enhance Turkish text categorization. Turkish Journal of Electrical Engineering and Computer Sciences, 2021; 29(2): 514-530.
  • [24] Cimen E. A random subspace based conic functions ensemble classifier. Turkish Journal of Electrical Engineering and Computer Sciences, 2020; 28(4): 2165-2182.
  • [25] Wang H, Hong M. Supervised Hebb rule based feature selection for text classification. Information Processing & Management, 2019; 56(1): 167-191.
  • [26] Forman G, Scholz M. Apples-to-apples in cross-validation studies: pitfalls in classifier performance measurement. ACM SIGKDD Explorations Newsletter, 2010; 12(1): 49-57.
  • [27] Eminagaoglu M. A new similarity measure for vector space models in text classification and information retrieval. Journal of Information Science, 2022; 48(4): 463-476.