Beytullah YILDIZ

Efficient Text Classification with Deep Learning on Imbalanced Data Improved with Better Distribution

Technological developments and the widespread use of the internet cause the data produced on a daily basis to increase exponentially. An important part of this deluge of data is text data from applications such as social media, communication tools, customer service. The processing of this large amount of text data needs automation. Significant successes have been achieved in text processing recently. Especially with deep learning applications, text classification performance has become quite satisfactory. In this study, we proposed an innovative data distribution algorithm that reduces the data imbalance problem to further increase the text classification success. Experiment results show that there is an improvement of approximately 3.5% in classification accuracy and over 3 in F1 score with the algorithm that optimizes the data distribution.

Keywords:

Text classification, Data Imbalance, Data Distribution, Deep learning Word Embedding.,

PDF

___

[1] Lai S, Xu L, Liu K, Zhao J. Recurrent convolutional neural networks for text classification. In: 29th AAAI conference on artificial intelligence, Austin, Texas USA, January 25–30, 2015 2015.
[2] Minaee S, Kalchbrenner N, Cambria E, Nikzad N, Chenaghlu M, Gao J. Deep Learning-based Text Classification: A Comprehensive Review. ACM Computing Surveys (CSUR), vol. 54, no. 3, pp. 1-40, 2021.
[3] Tufek A, Aktas M S. On the provenance extraction techniques from large scale log files: a case study for the numerical weather prediction models. In: European Conference on Parallel Processing, 2020 : Springer, pp. 249-260.
[4] Tezgider M, Yildiz B, Aydin G. Text classification using improved bidirectional transformer. Concurrency and Computation: Practice and Experience, p. e6486.
[5] Soyalp G, Alar A, Ozkanli K, Yildiz B. Improving Text Classification with Transformer. In: 2021 6th International Conference on Computer Science and Engineering (UBMK), 2021; Ankara, Turkey, IEEE pp. 707-712.
[6] Mikolov T, Sutskever I, Chen K, Corrado G, Dean J. Distributed representations of words and phrases and their compositionality. In: 26th International Conference on Neural Information Processing Systems, 2013, Lake Tahoe, Nevada, pp. 3111-3119.
[7] Joulin A, Grave E, Bojanowski P, Mikolov T. Bag of Tricks for Efficient Text Classification. In: 15th Conference of the European Chapter of the Association for Computational Linguistics, April 2017, Valencia, Spain: Association for Computational Linguistics, in Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pp. 427-431.
[8] Pennington J, Socher R, Manning C. Glove: Global Vectors for Word Representation. In: The Conference on Empirical Methods in Natural Language Processing (EMNLP). October 2014 Doha, Qatar: Association for Computational Linguistics, pp. 1532-1543.
[9] Devlin J. Chang M W, Lee K, Toutanova K. BERT: pre-training of deep bidirectional transformers for language understanding. In: 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2019, Minneapolis, MN, USA.
[10] Padurariu C, Breaban M E. Dealing with data imbalance in text classification. Procedia Computer Science, 2019, vol. 159, pp. 736-745.
[11] Kowsari K, Jafari Meimandi K, Heidarysafa M, Mendu S, Barnes L, Brown D. Text Classification Algorithms: A Survey. Information, 2019, vol. 10, no. 4, p. 150.
[12] Yildiz B, Tezgider M. Improving word embedding quality with innovative automated approaches to hyperparameters. Concurrency and Computation: Practice and Experience, 2021 p. e6091.
[13] Yildiz B, Tezgider M. Learning Quality Improved Word Embedding with Assessment of Hyperparameters. In European Conference on Parallel Processing, 2019: Springer, pp. 506-518.
[14] Li Y, Sun G, Zhu Y. Data imbalance problem in text classification. In: 2010 Third International Symposium on Information Processing, 2010: IEEE, pp. 301-305.
[15] Dixon L, Li J, Sorensen J, Thain N, Vasserman L. Measuring and mitigating unintended bias in text classification. In: Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, 2018, pp. 67-73.
[16] Shi K, Li L, Liu H, He J, Zhang N, Song W. An improved KNN text classification algorithm based on density. In: 2011 IEEE International Conference on Cloud Computing and Intelligence Systems, 2011: IEEE, pp. 113-117.
[17] Ogura H, Amano H, Kondo M. Comparison of metrics for feature selection in imbalanced text classification. Expert Systems with Applications, 2011, vol. 38, no. 5, pp. 4978-4989.
[18] Liu Y, Loh H T, Sun A. Imbalanced text classification: A term weighting approach. Expert systems with Applications, 2009, vol. 36, no. 1, pp. 690-701.
[19] Liu Y, Loh H T, Kamal Y T, Tor and S B. Handling of imbalanced data in text classification: Category-based term weights. In: Natural language processing and text mining: Springer, 2007, pp. 171-192.
[20] Thabtah F, Hammoud S, Kamalov F, Gonsalves A. Data imbalance in classification: Experimental evaluation. Information Sciences, 2020, vol. 513, pp. 429-441.
[21] Rehurek R, Sojka P. Software framework for topic modelling with large corpora. In: Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, 2010: Citeseer.
[22] Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word representations in vector space. arXiv preprint, 2013, arXiv:1301.3781.
[23] Olmezogullari E, AktasM S. Pattern2Vec: Representation of clickstream data sequences for learning user navigational behavior. Concurrency and Computation: Practice and Experience, 2021, p. e6546.
[24] Hallac I R, Makinist S, Ay B, and Aydin G. user2vec: Social media user representation based on distributed document embeddings. In: 2019 International Artificial Intelligence and Data Processing Symposium (IDAP), 2019: IEEE, pp. 1-5.