Exploring bigram character features for Arabic text clustering

Exploring bigram character features for Arabic text clustering

The vector space model (VSM) is an algebraic model that is widely used for data representation in textmining applications. However, the VSM poses a critical challenge, as it requires a high-dimensional feature space.Therefore, many feature selection techniques, such as employing roots or stems (i.e. words without infixes and prefixes,and/or suffixes) instead of using complete word forms, are proposed to tackle this space challenge problem. Recently,the literature shows that one more basic unit feature can be used to handle the textual features, which is the twoneighboring character form that we call microword. To evaluate this feature type, we measure the accuracy of theArabic text clustering using two feature types: the complete word form and the microword form. Hence, the microwordis two consecutive characters which are also known as the Bigram character feature. In the experiment, the principalcomponent analysis (PCA) is used to reduce the feature vector dimensions while the k-means algorithm is used for theclustering purposes. The testing set includes 250 documents of five categories. The entire corpus contains 54,472 words,whereas the vocabulary contains 13,356 unique words. The experimental results show that the complete word formscore accuracy is 97.2% while the two-character form score is 96.8%. In conclusion, the accuracies are almost the same;however, the two-character form uses a smaller vocabulary as well as less PCA subspaces. The study experiments mightbe a significant indication of the necessity to consider the Bigram character feature in the future text processing andnatural language processing applications.

___

  • [1] Salton G, Buckley C. Term-weighting approaches in automatic text retrieval. Information Processing & Management 1988; 24 (5): 503-523.
  • [2] Al-Anzi FS, AbuZeina D. Beyond vector space model for hierarchical Arabic text classification: A Markov chain approach. Information Processing & Management 2018; 54 (1): 105-115.
  • [3] Poomagal S, Visalakshi P, Hamsapriya T. A novel method for clustering tweets in twitter. International Journal of Web Based Communities. 2015; 11 (2): 170-87.
  • [4] Bsoul Q, Salim J, Zakaria LQ. An intelligent document clustering approach to detect crime patterns. Procedia Technology 2013; 11: 1181-1187.
  • [5] Aljaber B, Stokes N, Bailey J, Pei J. Document clustering of scientific texts using citation contexts. Information Retrieval 2010; 13 (2): 101-131.
  • [6] Banerjee S, Ramanathan K, Gupta A. Clustering short texts using wikipedia. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval; the Netherlands; 2007. pp. 787-788.
  • [7] Maghawry AM, Omar Y, Badr A. Initial centroid selection optimization for K-Means with genetic algorithm to enhance clustering of transcribed Arabic Broadcast News Documents. In: Proceedings of the Computational Methods in Systems and Software; Cham; 2017. pp. 86-101.
  • [8] Alghamdi HM, Selamat A, Karim NS. Arabic web pages clustering and annotation using semantic class features. Journal of King Saud University-Computer and Information Sciences 2014; 26 (4): 388-397.
  • [9] Deepak P, Rao D, Khemani D. Building clusters of related words: an unsupervised approach. In Pacific Rim International Conference on Artificial Intelligence; Berlin, Heidelberg; 2006 pp. 474-483.
  • [10] Romeo S, Tagarelli A, Ienco D. Semantic-based multilingual document clustering via tensor modeling. In EMNLP, Conference on Empirical Methods in Natural Language Processing; Doha, Qatar; 2014. pp. 10-18.
  • [11] Volkovich Z, Kirzhner V, Bolshoy A, Nevo E, Korol A. The method of N-grams in large-scale clustering of DNA texts. Pattern Recognition 2005; 38 (11): 1902-1912.
  • [12] Al-Anzi FS, AbuZeina D. Stemming impact on Arabic text categorization performance: A survey. In: 2015 5th International Conference on Information & Communication Technology and Accessibility (ICTA); Marrakesh, Morocco; 2015. pp. 1-7.
  • [13] Al-Anzi FS, AbuZeina D. Toward an enhanced Arabic text classification using cosine similarity and Latent Semantic Indexing. Journal of King Saud University-Computer and Information Sciences. 2017; 29 (2): 189-195.
  • [14] Kutbay U, Ural AB, Hardalaç F. Underground electrical profile clustering using K-MEANS algorithm. In2015 23nd Signal Processing and Communications Applications Conference (SIU); Malatya, Turkey; 2015. pp. 561-564.
  • [15] Hardalaç F, Kutbay U, Şahin İ, Akyel A. A novel method for robust object tracking with K-means clustering using histogram back-projection technique. Multimedia Tools and Applications. 2018; 77 (18): 24059-24072.
  • [16] Kutbay U. Partitional clustering. In: Recent Applications in Data Clustering 2018 In techOpen.
  • [17] AbuZeina D, Al-Anzi FS. Employing fisher discriminant analysis for Arabic text classification. Computers & Electrical Engineering 2018; 66: 474-486.
  • [18] Harrag F, El-Qawasmah E, Al-Salman AM. Comparing dimension reduction techniques for Arabic text classification using BPNN algorithm. In: 2010 First International Conference on Integrated Intelligent Computing; Bangalore, India; 2010. pp. 6-11.
  • [19] Sawaf H, Zaplo J, Ney H. Statistical classification methods for Arabic News articles. In Proceedings of the Arabic Natural Language Processing Workshop (ACL20001); Toulouse, France; 2001. pp.1-6.
  • [20] Sharef BT, Omar N, Sharef ZT. An automated arabic text categorization based on the frequency ratio accumulation. The International Arab Journal of Information Technology 2014; 11 (2): 213-221.
  • [21] Al-Anzi FS, AbuZeina D. A micro-word based approach for arabic sentiment analysis. In2017 IEEE/ACS 14th International Conference on Computer Systems and Applications (AICCSA. Hammamet, Tunisia; 2017. pp. 910- 914).
  • [22] Khreisat L. A machine learning approach for Arabic text classification using N-gram frequency statistics. Journal of Informetrics 2009; 3 (1): 72-77.
  • [23] Zrigui M, Ayadi R, Mars M, Maraoui M. Arabic text classification framework based on latent dirichlet allocation. Journal of Computing and Information Technology 2012; 20 (2): 125-140.
  • [24] Güven A, Bozkurt ÖÖ, Kalıpsız O. Advanced information extraction with n-gram based LSI. In: Proceedings of World Academy of Science, Engineering and Technology 2006; 17: 13-18.
  • [25] Silva C, Ribeiro B. The importance of stop word removal on recall values in text categorization. In Proceedings of the International Joint Conference on Neural Networks; Portland,USA; 2003. pp. 1661-1666.
  • [26] Al-Anzi FS, AbuZeina D, Hasan S. Utilizing standard deviation in text classification weighting schemes. The International Journal of Innovative Computing, Information and Control 2017; 13: 4.
  • [27] Ghiassi M, Skinner J, Zimbra D. Twitter brand sentiment analysis: a hybrid system using n-gram analysis and dynamic artificial neural network. Expert Systems with Applications 2013; 40 (16): 6266-6282.
  • [28] Al-Shalabi R, Obeidat R. Improving KNN Arabic text classification with n-grams based document indexing. In: Proceedings of the Sixth International Conference on Informatics and Systems; Cairo, Egypt; 2008. pp. 108-112.
  • [29] Song F, Liu S, Yang J. A comparative study on text representation schemes in text categorization. Pattern Analysis and Applications 2005; 8 (1-2): 199-209.
  • [30] Al-Anzi FS, AbuZeina D. A new enhanced variation of TF-IDF scheme for Arabic text classification. Health 2016; 400: 218-4.
  • [31] Al-Anzi FS, AbuZeina D. Enhanced Search for Arabic Language Using Latent Semantic Indexing (LSI). In: 2018 International Conference on Intelligent and Innovative Computing Applications (ICONIC); Bangkok, Thailand; 2018. pp. 1-4.
Turkish Journal of Electrical Engineering and Computer Sciences-Cover
  • ISSN: 1300-0632
  • Yayın Aralığı: Yılda 6 Sayı
  • Yayıncı: TÜBİTAK
Sayıdaki Diğer Makaleler

Importance-based signal detection and parameter estimation with applications to new particle search

Hatice DOĞAN, Güleser K. DEMİR, Nasuf SÖNMEZ

A hybrid single-source shortest path algorithm

Murat MANGUOĞLU, Hilal ARSLAN

Limited-data automatic speaker verification algorithm using band-limited phase-only correlation function

Ángel David PEDROZA RAMÍREZ, Aldonso BECERRA SÁNCHEZ, José de Jesús VILLA HERNÁNDEZ, José Ismael DE LA ROSA VARGAS

Space-track modulation and coding for high density aerial vehicle downlink networks with free space optical and visible light communications

Burhan GÜLBAHAR

Early reliability assessment of component-based software system using colored petri net

Amir HOSSEINZADEH-MOKARRAM, Ayaz ISAZADEH, Habib IZADKHAH

Particle swarm optimization approach to optimal design of an AFPM traction machine for different driving conditions

Naghi ROSTAMI

Channel and carrier frequency offset estimation based on projection onto a bidimensional basis

Aldo Gustavo OROZCO LUGO, Roberto CARRASCO ALVAREZ, Ramon PARRA MICHEL, Marco Antonio GURROLA NAVARRO

PV-based off-board electric vehicle battery charger using BIDC

Ankita PAUL, Krithiga SUBRAMANIAN, Sujitha NACHINARKINIYAN

Extraction and selection of statistical harmonics features for electrical appliances identification using k-NN classifier combined with voting rules method

Philippe RAVIER, Tayeb MOHAMADI, Fateh GHAZALI, Abdenour HACINE-GHARBI

Wavelet energy-based stable and unstable power swing detection scheme for distance relays

Ramamoorty MYLAVARAPU, Naga Chaitanya MUNUKUTLA, Venkata Siva Krishna Rao GADI