SSC: Clustering Of Turkish Texts By Spectral Graph Partitioning

There is growing interest in studies on text classification as a result of the exponential increase in the amount of data available. Many studies have been conducted in the field of text clustering, using different approaches. This study introduces Spectral Sentence Clustering (SSC) for text clustering problems, which is an unsupervised method based on graph-partitioning. The study explains how the proposed model proposed can be used in natural language applications to successfully cluster texts. A spectral graph theory method is used to partition the graph into non-intersecting sub-graphs, and an unsupervised and efficient solution is offered for the text clustering problem by providing a physical representation of the texts. Finally, tests have been conducted demonstrating that SSC can be successfully used for text categorization. A clustering success rate of 97.08% was achieved in tests conducted using the TTC-3600 dataset, which contains open-access unstructured Turkish texts, classified into categories. The SSC model proposed performed better compared to a popular k-means clustering algorithm.

SSC: Clustering Of Turkish Texts By Spectral Graph Partitioning

There is growing interest in studies on text classification as a result of the exponential increase in the amount of data available. Many studies have been conducted in the field of text clustering, using different approaches. This study introduces Spectral Sentence Clustering (SSC) for text clustering problems, which is an unsupervised method based on graph-partitioning. The study explains how the proposed model proposed can be used in natural language applications to successfully cluster texts. A spectral graph theory method is used to partition the graph into non-intersecting sub-graphs, and an unsupervised and efficient solution is offered for the text clustering problem by providing a physical representation of the texts. Finally, tests have been conducted demonstrating that SSC can be successfully used for text categorization. A clustering success rate of 97.08% was achieved in tests conducted using the TTC-3600 dataset, which contains open-access unstructured Turkish texts, classified into categories. The SSC model proposed performed better compared to a popular k-means clustering algorithm.

___

  • [1] Ş. Canberk, G. , Sağıroğlu, Bilgi ve Bilgisayar Güvenliği : Casus Yazılımlar ve Korunma Yöntemleri. Ankara: Grafiker Yayıncılık, 2006.
  • [2] Osman Durmaz;, “Metin Sınıflandırmada Boyut Azaltmanın Etkisi ve Özellik Seçimi,” Gazi Üniversitesi, 2011.
  • [3] C. Hark, A. Seyyarer, T. Uçkan, and A. Karci, “Doǧal dil işleme yaklaşimlari ile yapisal olmayan dökümanlarin benzerliǧi,” in IDAP 2017 - International Artificial Intelligence and Data Processing Symposium, 2017.
  • [4] D. D. Lewis, “Naive (Bayes) at forty: The independence assumption in information retrieval,” in European conference on machine learning, 1998, pp. 4–15.
  • [5] K. Aas and L. Eikvil, “Text categorisation: A survey.” Technical report, Norwegian computing center, 1999.
  • [6] N. O. Andrews and E. A. Fox, “Recent Developments in Document Clustering,” 2007.
  • [7] N. Shah, “Document Clustering : A Detailed Review,” vol. 4, no. 5, pp. 30–38, 2012.
  • [8] A. Pujari, “Data mining techniques,” 2001.
  • [9] S. N. Bharath Bhushan and A. Danti, “Classification of text documents based on score level fusion approach,” Pattern Recognit. Lett., vol. 94, pp. 118–126, Jul. 2017.
  • [10] R. J. Wilson and J. J. Watkins, Graphs: an introductory approach: a first course in discrete mathematics. John Wiley & Sons Inc, 1990.
  • [11] H. Qiu and E. R. Hancock, “Graph matching and clustering using spectral partitions,” Pattern Recognit., vol. 39, pp. 22–34, 2006.
  • [12] D. Kılınç, A. Özçift, F. Bozyigit, P. Yıldırım, F. Yücalar, and E. Borandag, “TTC-3600: A new benchmark dataset for Turkish text categorization,” J. Inf. Sci., vol. 43, no. 2, pp. 174–185, Apr. 2017.
  • [13] C. Hark, A. Seyyarer, T. Uçkan, and A. Karci, “Doǧal dil işleme yaklaşimlari ile yapisal olmayan dökümanlarin benzerliǧi,” in IDAP 2017 - International Artificial Intelligence and Data Processing Symposium, 2017, pp. 1–6.
  • [14] C. Hark, T. Uçkan, and A. Seyyarer, Abubekir Karci, “Metin Özetleme İçin Çizge Tabanlı Bir Öneri,” in IDAP 2018 - International Artificial Intelligence and Data Processing Symposium, 2018.
  • [15] A. Karci, “Çizge Algoritmaları ve Çizge Bölmeleme.pdf,” 1998.
  • [16] U. Von Luxburg, “A Tutorial on Spectral Clustering,” 2007.
  • [17] B. Slininger, “Fiedler’s Theory of Spectral Graph Partitioning.”
  • [18] H. Jia, S. Ding, and M. Du, “Self-Tuning p-Spectral Clustering Based on Shared Nearest Neighbors,” Cognit. Comput.
  • [19] L. Macdonald, “Successive Approximations by the Rayleigh-Ritz Variation Methoti,” 1933.
  • [20] L. Hagen, … A. K. on computer-aided design of, and U. 1992, “New spectral methods for ratio cut partitioning and clustering,” ieeexplore.ieee.org.
  • [21] S. Jianbo and M. Jitendra, “Normalized cuts and image segmentation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 22, no. 8, pp. 888–905, 2000.
  • [22] A. Ben Ayed, M. Ben Halima, and A. M. Alimi, “Adaptive fuzzy exponent cluster ensemble system based feature selection and spectral clustering,” IEEE Int. Conf. Fuzzy Syst., no. July, 2017.
  • [23] M. Filippone, F. Camastra, F. Masulli, and S. Rovetta, “A survey of kernel and spectral methods for clustering,” Pattern Recognit., vol. 41, pp. 176–190, 2008.
  • [24] W. R. Casper and B. Nadiga, “A New Spectral Clustering Algorithm.”
  • [25] R. M. Alguliev, R. M. Aliguliyev, and M. S. Hajirahimova, “GenDocSum+ MCLR: Generic document summarization based on maximum coverage and less redundancy,” Expert Syst. Appl., vol. 39, no. 16, pp. 12460–12473, 2012.
  • [26] M. Fiedler, “Algebraic connectivity of graphs,” Czechoslov. Math. J., vol. 23, no. 2, pp. 298–305, 1973.
  • [27] F. R. K. Chung, “Lectures on Spectral Graph Theory.”
  • [28] C. T. Zheng, C. Liu, and H. S. Wong, “Corpus-based topic diffusion for short text clustering,” Neurocomputing, vol. 275, pp. 2444–2458, 2018.
  • [29] J. Xu et al., “Self-Taught convolutional neural networks for short text clustering,” Neural Networks, vol. 88, pp. 22–31, 2017.
  • [30] F. Beil, M. Ester, B. Bc, and C. Va, “Frequent Term-Based Text Clustering,” 2002.
  • [31] C. Manning, P. Raghavan, and H. Schütze, “Introduction to information retrieval,” Nat. Lang. Eng., vol. 16, no. 1, pp. 100–103, 2010.
  • [32] M. Kozlowski and H. Rybinski, “Clustering of semantically enriched short texts,” J. Intell. Inf. Syst., Dec. 2018.
Politeknik Dergisi-Cover
  • ISSN: 1302-0900
  • Yayın Aralığı: Yılda 4 Sayı
  • Başlangıç: 1998
  • Yayıncı: GAZİ ÜNİVERSİTESİ