Dia ABUZEINA

Exploring bigram character features for Arabic text clustering

The vector space model (VSM) is an algebraic model that is widely used for data representation in textmining applications. However, the VSM poses a critical challenge, as it requires a high-dimensional feature space.Therefore, many feature selection techniques, such as employing roots or stems (i.e. words without infixes and prefixes,and/or suffixes) instead of using complete word forms, are proposed to tackle this space challenge problem. Recently,the literature shows that one more basic unit feature can be used to handle the textual features, which is the twoneighboring character form that we call microword. To evaluate this feature type, we measure the accuracy of theArabic text clustering using two feature types: the complete word form and the microword form. Hence, the microwordis two consecutive characters which are also known as the Bigram character feature. In the experiment, the principalcomponent analysis (PCA) is used to reduce the feature vector dimensions while the k-means algorithm is used for theclustering purposes. The testing set includes 250 documents of five categories. The entire corpus contains 54,472 words,whereas the vocabulary contains 13,356 unique words. The experimental results show that the complete word formscore accuracy is 97.2% while the two-character form score is 96.8%. In conclusion, the accuracies are almost the same;however, the two-character form uses a smaller vocabulary as well as less PCA subspaces. The study experiments mightbe a significant indication of the necessity to consider the Bigram character feature in the future text processing andnatural language processing applications.

PDF

___

[1] Salton G, Buckley C. Term-weighting approaches in automatic text retrieval. Information Processing & Management 1988; 24 (5): 503-523.
[2] Al-Anzi FS, AbuZeina D. Beyond vector space model for hierarchical Arabic text classification: A Markov chain approach. Information Processing & Management 2018; 54 (1): 105-115.
[3] Poomagal S, Visalakshi P, Hamsapriya T. A novel method for clustering tweets in twitter. International Journal of Web Based Communities. 2015; 11 (2): 170-87.
[4] Bsoul Q, Salim J, Zakaria LQ. An intelligent document clustering approach to detect crime patterns. Procedia Technology 2013; 11: 1181-1187.
[5] Aljaber B, Stokes N, Bailey J, Pei J. Document clustering of scientific texts using citation contexts. Information Retrieval 2010; 13 (2): 101-131.
[6] Banerjee S, Ramanathan K, Gupta A. Clustering short texts using wikipedia. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval; the Netherlands; 2007. pp. 787-788.
[7] Maghawry AM, Omar Y, Badr A. Initial centroid selection optimization for K-Means with genetic algorithm to enhance clustering of transcribed Arabic Broadcast News Documents. In: Proceedings of the Computational Methods in Systems and Software; Cham; 2017. pp. 86-101.
[8] Alghamdi HM, Selamat A, Karim NS. Arabic web pages clustering and annotation using semantic class features. Journal of King Saud University-Computer and Information Sciences 2014; 26 (4): 388-397.
[9] Deepak P, Rao D, Khemani D. Building clusters of related words: an unsupervised approach. In Pacific Rim International Conference on Artificial Intelligence; Berlin, Heidelberg; 2006 pp. 474-483.
[10] Romeo S, Tagarelli A, Ienco D. Semantic-based multilingual document clustering via tensor modeling. In EMNLP, Conference on Empirical Methods in Natural Language Processing; Doha, Qatar; 2014. pp. 10-18.
[11] Volkovich Z, Kirzhner V, Bolshoy A, Nevo E, Korol A. The method of N-grams in large-scale clustering of DNA texts. Pattern Recognition 2005; 38 (11): 1902-1912.
[12] Al-Anzi FS, AbuZeina D. Stemming impact on Arabic text categorization performance: A survey. In: 2015 5th International Conference on Information & Communication Technology and Accessibility (ICTA); Marrakesh, Morocco; 2015. pp. 1-7.
[13] Al-Anzi FS, AbuZeina D. Toward an enhanced Arabic text classification using cosine similarity and Latent Semantic Indexing. Journal of King Saud University-Computer and Information Sciences. 2017; 29 (2): 189-195.
[14] Kutbay U, Ural AB, Hardalaç F. Underground electrical profile clustering using K-MEANS algorithm. In2015 23nd Signal Processing and Communications Applications Conference (SIU); Malatya, Turkey; 2015. pp. 561-564.
[15] Hardalaç F, Kutbay U, Şahin İ, Akyel A. A novel method for robust object tracking with K-means clustering using histogram back-projection technique. Multimedia Tools and Applications. 2018; 77 (18): 24059-24072.
[16] Kutbay U. Partitional clustering. In: Recent Applications in Data Clustering 2018 In techOpen.
[17] AbuZeina D, Al-Anzi FS. Employing fisher discriminant analysis for Arabic text classification. Computers & Electrical Engineering 2018; 66: 474-486.
[18] Harrag F, El-Qawasmah E, Al-Salman AM. Comparing dimension reduction techniques for Arabic text classification using BPNN algorithm. In: 2010 First International Conference on Integrated Intelligent Computing; Bangalore, India; 2010. pp. 6-11.
[19] Sawaf H, Zaplo J, Ney H. Statistical classification methods for Arabic News articles. In Proceedings of the Arabic Natural Language Processing Workshop (ACL20001); Toulouse, France; 2001. pp.1-6.
[20] Sharef BT, Omar N, Sharef ZT. An automated arabic text categorization based on the frequency ratio accumulation. The International Arab Journal of Information Technology 2014; 11 (2): 213-221.
[21] Al-Anzi FS, AbuZeina D. A micro-word based approach for arabic sentiment analysis. In2017 IEEE/ACS 14th International Conference on Computer Systems and Applications (AICCSA. Hammamet, Tunisia; 2017. pp. 910- 914).
[22] Khreisat L. A machine learning approach for Arabic text classification using N-gram frequency statistics. Journal of Informetrics 2009; 3 (1): 72-77.
[23] Zrigui M, Ayadi R, Mars M, Maraoui M. Arabic text classification framework based on latent dirichlet allocation. Journal of Computing and Information Technology 2012; 20 (2): 125-140.
[24] Güven A, Bozkurt ÖÖ, Kalıpsız O. Advanced information extraction with n-gram based LSI. In: Proceedings of World Academy of Science, Engineering and Technology 2006; 17: 13-18.
[25] Silva C, Ribeiro B. The importance of stop word removal on recall values in text categorization. In Proceedings of the International Joint Conference on Neural Networks; Portland,USA; 2003. pp. 1661-1666.
[26] Al-Anzi FS, AbuZeina D, Hasan S. Utilizing standard deviation in text classification weighting schemes. The International Journal of Innovative Computing, Information and Control 2017; 13: 4.
[27] Ghiassi M, Skinner J, Zimbra D. Twitter brand sentiment analysis: a hybrid system using n-gram analysis and dynamic artificial neural network. Expert Systems with Applications 2013; 40 (16): 6266-6282.
[28] Al-Shalabi R, Obeidat R. Improving KNN Arabic text classification with n-grams based document indexing. In: Proceedings of the Sixth International Conference on Informatics and Systems; Cairo, Egypt; 2008. pp. 108-112.
[29] Song F, Liu S, Yang J. A comparative study on text representation schemes in text categorization. Pattern Analysis and Applications 2005; 8 (1-2): 199-209.
[30] Al-Anzi FS, AbuZeina D. A new enhanced variation of TF-IDF scheme for Arabic text classification. Health 2016; 400: 218-4.
[31] Al-Anzi FS, AbuZeina D. Enhanced Search for Arabic Language Using Latent Semantic Indexing (LSI). In: 2018 International Conference on Intelligent and Innovative Computing Applications (ICONIC); Bangkok, Thailand; 2018. pp. 1-4.