Optimal küme sayısının belirlenmesinde kullanılan yöntemlerin karşılaştırılması

Kümeleme, gözlemleri benzerliklerine göre gruplarına ayıran bir denetimsiz öğrenme şeklidir. En yaygın olarak kullanılan kümeleme algoritması k-ortalamadır. Ancak bu kümeleme algoritmasında küme sayısının önceden belirlenmesi gerekmektedir. Bu çalışmada en çok kullanılan küme sayısı belirleme yöntemlerinden Ortalama Silüet (Average Silhouette), Caliński-Harabasz, Davies-Bouldin ve Dunn Endeksi kullanılmıştır. Bu yöntemlerin performansları küme sayısı önceden belli olan dokuz gerçek veri seti üzerinde Rand Endeksi ve Meila bilgi kriteri (Meila’s Variation of Information-MVI) kriterleri ile karşılaştırılmıştır. Bu kriterlere göre değerlendirildiğinde Ortalama Silüet ile daha başarılı sonuçlar elde edilmiştir.

Comparison of the methods to determine optimal number of cluster

Clustering is an unsupervised learning that divides observations into groups based on their similarity. The most widely used clustering algorithm is k-means. However, in this clustering algorithm, the number of clusters needs to be determined in advance. In this study, the most widely used methods for determining the number of clusters, namely Average Silhouette, Caliński-Harabasz, Davies-Bouldin and Dunn Index were used. The performances of these methods were compared by Rand Index and Meila's Variation of Information (MVI) criteria on nine real data sets where the number of clusters was known in advance. According to these criterias, Average Silhouette was given more successful results.

___

  • [1] El Naqa, I., & Murphy, M. J. What is machine learning?. In machine learning in radiation oncology (pp. 3-11). Springer, Cham. 2015
  • [2] Learned-Miller, E. G. Introduction to supervised learning. I: Department of Computer Science, University of Massachusetts, 3. 2014
  • [3] Hady, M. F. A., & Schwenker, F. Semi-supervised learning. Handbook on Neural Information Processing, 215-239. 2013
  • [4] Hinton, G., & Sejnowski, T. J. (Eds.). Unsupervised learning: foundations of neural computation. MIT press. 1999
  • [5] Holzinger, K. J., & Harman, H. H. Factor analysis; a synthesis of factorial methods. 1941
  • [6] Sokal, R. R. Numerical taxonomy. Scientific American, 215(6), 106-117. 1966
  • [7] Barbará, D., & Jajodia, S. (Eds.). Applications of data mining in computer security (Vol. 6). Springer Science & Business Media. 2002
  • [8] Mirkin, B. Clustering for data mining: a data recovery approach. Chapman and Hall/CRC. 2005
  • [9] Chou, C. H., Su, M. C., & Lai, E. A new cluster validity measure and its application to image compression. Pattern Analysis and Applications, 7(2), 205-220. 2004
  • [10] Arbelaitz, O., Gurrutxaga, I., Muguerza, J., Pérez, J. M., & Perona, I. An extensive comparative study of cluster validity indices. Pattern recognition, 46(1), 243-256. 2013
  • [11] Berthold, M. R., & Höppner, F. On clustering time series using euclidean distance and pearson correlation. arXiv preprint arXiv:1601.02213. 2016
  • [12] Kassambara, A. Practical guide to cluster analysis in R: Unsupervised machine learning (Vol. 1). Sthda. 2017
  • [13] Marutho, D., Handaka, S. H., & Wijaya, E. The determination of cluster number at k-mean using elbow method and purity evaluation on headline news. In 2018 international seminar on application for technology of information and communication (pp. 533-538). IEEE. 2018, September
  • [14] Govender, P., & Sivakumar, V. Application of k- means and hierarchical clustering techniques for analysis of air pollution: A review (1980–2019). Atmospheric pollution research, 11(1), 40-56,2020 [15] Sinaga, K. P., & Yang, M. S. Unsupervised K-means clustering algorithm. IEEE access, 8, 80716-80727. 2020
  • [16] Yuan, C., & Yang, H. Research on K-value selectionmethod of K-means clustering algorithm. J, 2(2), 226- 235. 2019
  • [17] Celebi, M. E., Kingravi, H. A., & Vela, P. A. A comparative study of efficient initialization methods for the k-means clustering algorithm. Expert systems with applications, 40(1), 200-210. 2013
  • [18] Hartigan, J. A., & Wong, M. A. Algorithm AS 136: A K-Means Clustering Algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics), 28(1), 100–108. https://doi.org/10.2307/2346830. 1979
  • [19] Löster, T. Determining the optimal number of clusters in cluster analysis. Proceedings of the 10th International Days of Statistics and Economics, Prague, Czech Republic, 8-10. 2016.
  • [20] Desgraupes, B. Clustering indices. University of Paris Ouest-Lab Modal’X, 1, 34. 2013
  • [21] Hasanpour, H., Asadi, S., Meibodi, R. G., Daraeian, A., Ahmadiani, A., Shams, J., & Navi, K. A critical appraisal of heterogeneity in obsessive-compulsive disorder using symptom-based clustering analysis. Asian journal of psychiatry, 28, 89-96. 2017.
  • [22] Xu, R., Xu, J., & Wunsch, D. C. A comparison study of validity indices on swarm-intelligence-based clustering. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 42(4), 1243-1256. 2012.
  • [23] Anitha, S., & Metilda, M. A. R. Y. (2019). An extensive investigation of outlier detection by cluster validation indices. Ciencia e Tecnica Vitivinicola-A Science and Technology Journal, 34(2), 22-32. 2019.
  • [24] Baarsch, J., & Celebi, M. E. Investigation of internal validity measures for K-means clustering. In Proceedings of the international multiconference of engineers and computer scientists (Vol. 1, pp. 14-16). sn. 2012.
  • [25] Hartigan, John A., Manchek A. Wong. Algorithm AS 136: A k-means clustering algorithm. Journal of the royal statistical society. series c (applied statistics) 28., 100-108, 1979
  • [26] Kassambara, Alboukadel. Practical guide to cluster analysis in R: Unsupervised machine learning. Vol. 1. Sthda, 2017.
  • [27] Ben-Hur, Asa, and Isabelle Guyon. Detecting stable clusters using principal component analysis. Functional genomics. Humana press, 159-182, 2003.
  • [28] Ding, Chris, and Xiaofeng He. K-means clustering via principal component analysis. Proceedings of the twenty-first international conference on Machine learning. 2004.
  • [29] Kassambara, Alboukadel. Practical guide to cluster analysis in R: Unsupervised machine learning. Vol. 1. Sthda, 2017.
  • [30] Hartigan, John A., Manchek A. Wong. Algorithm AS 136: A k-means clustering algorithm. Journal of the royal statistical society. series c (applied statistics) 28., 100-108, 1979
  • [31] Rousseeuw, Peter J. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of computational and applied mathematics, 1987, 20: 53-65.
  • [32] Caliński T, Harabasz J. A dendrite method for cluster analysis. Communications in Statistics-theory and Methods 3.1, 1-27, 1974.
  • [33] Davies DL, Bouldin, DW. A cluster separation measure. IEEE transactions on pattern analysis and machine intelligence, (2), 224-227, 1979
  • [34] Dunn JC. A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters. 32-57, 1973
  • [35] Warrens, M. J., & van der Hoef, H. Understanding the rand index. In Advanced Studies in Classification and Data Science (pp. 301-313). Springer, Singapore. 2020
  • [36] Meilă M. "Comparing clusterings -an information based distance." J.Multivariate Analysis, 98(5), 873-895, 2007.
  • [37] Kassambara, Alboukadel, and Fabian Mundt. “Factoextra: Extract and Visualize the Results of Multivariate Data Analyses.” https://CRAN.R-project.org/package=factoextra. 2020
  • [38] Charrad, Malika, Nadia Ghazzali, Véronique Boiteau, and Azam Niknafs. “NbClust: An r Package for Determining the Relevant Number of Clusters in a Data Set” 61. https://www.jstatsoft.org/v61/i06/. 2014.
  • [39] Hennig, Christian. “Fpc: Flexible Procedures for Clustering.” https://CRAN.R-project.org/package=fpc. 2020.
  • [40] H. T. Kahraman, Sagiroglu, S., Colak, I., Developing intuitive knowledge classifier and modeling of users' domain dependent data in web, Knowledge Based Systems, vol. 37, pp. 283-295, 2013.
  • [41] WEISS, Sholom M.; KULIKOWSKI, Casimir A. Computer systems that learn: classification and prediction methods from statistics, neural nets, machine learning, and expert systems. Morgan Kaufmann Publishers Inc. 1991.
  • [42] Dua, D. and Graff, C. UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science. 2019.
  • [43] Öztürk, F.E., and Demirel, N. Comparison of the Methods to Determine Optimal Number of Clusters. International Symposium in Graduate Researches on Data Sciences, 2-3 December 2022.
  • [44] Charrad, Malika, Nadia Ghazzali, Véronique Boiteau, and Azam Niknafs. “NbClust: An r Package for Determining the Relevant Number of Clusters in a Data Set” 61. https://www.jstatsoft.org/v61/i06/. 2014.
  • [45] Nanjundan, Sukavanan, et al. Identifying the number of clusters for K-Means: A hypersphere density based approach. arXiv preprint arXiv:1912.00643. 2019.
  • [46] Masud, Md Abdul, et al. "I-nice: A new approach for identifying the number of clusters and initial cluster centres." Information Sciences 466.129-151. 2018.