Clustering with density based initialization and Bhattacharyya based merging

Clustering with density based initialization and Bhattacharyya based merging

Centroid based clustering approaches, such as k-means, are relatively fast but inaccurate for arbitrary shape clusters. Fuzzy c-means with Mahalanobis distance can accurately identify clusters if data set can be modelled by a mixture of Gaussian distributions. However, they require number of clusters apriori and a bad initialization can cause poor results. Density based clustering methods, such as DBSCAN, overcome these disadvantages. However, they may perform poorly when the dataset is imbalanced. This paper proposes a clustering method, named clustering with density initialization and Bhattacharyya based merging based on the fuzzy clustering. The initialization is carried out by density estimation with adaptive bandwidth using k-Nearest Orthant-Neighbor algorithm to avoid the effects of imbalanced clusters. The local peaks of the point clouds constructed by the k-Nearest Orthant-Neighbor algorithm are used as initial cluster centers for the fuzzy clustering. We use Bhattacharyya measure and Jensen inequality to find overlapped Gaussians and merge them to form a single cluster. We carried out experiments on a variety of datasets and show that the proposed algorithm has remarkable advantages especially for imbalanced and arbitrarily shaped data sets.

___

  • [1] Forgy EW. Cluster analysis of multivariate data: efficiency versus interpretability of classifications. Biometrics 1965; 21: 768-769.
  • [2] Bandyopadhyay A, Deb K, Das A, Bag R. Impulse noise removal by k-means clustering identified fuzzy filter: a new approach. Turkish Journal of Electrical Engineering & Computer Sciences 2020; 28 (5): 2838-2862. doi: 10.3906/elk-1910-34
  • [3] Rezaee MJ, Eshkevari M, Saberi M, Hussain O. GBK-means clustering algorithm: An improvement to the K-means algorithm based on the bargaining game. Knowledge-Based Systems 2021; 213: 106672. doi: 10.1016/j.knosys.2020.106672
  • [4] Bezdek JC, Ehrlich R, Full W. FCM: The fuzzy c-means clustering algorithm. Computers & Geosciences 1984; 10 (2-3): 191-203. doi: 10.1016/0098-3004(84)90020-7
  • [5] Gueorguieva N, Valova I, Georgiev G. M&MFCM: fuzzy c-means clustering with Mahalanobis and Minkowski distance metrics. Procedia Computer Science 2017; 114: 224-233. doi: 10.1016/j.procs.2017.09.064
  • [6] Tellaroli P, Bazzi M, Donato M, Brazzale AR, Drăghici S. Cross-clustering: a partial clustering algorithm with automatic estimation of the number of clusters. PloS One 2016; 11 (3): e0152333. doi: 10.1371/journal.pone.0152333
  • [7] Gupta A, Datta S, Das S. Fast automatic estimation of the number of clusters from the minimum inter-center distance for k-means clustering. Pattern Recognition Letters 2018; 116: 72-79. doi: 10.1016/j.patrec.2018.09.003
  • [8] Ester M, Kriegel HP, Sander J, Xu X. A density-based algorithm for discovering clusters in large spatial databases with noise. In: KDD’96 Proceedings of the Second International Conference on Knowledge Discovery and Data Mining; Portland, Oregon, USA; 1996. vol. 96, pp. 226-231. doi: 10.5555/3001460.3001507
  • [9] Schubert E, Sander J, Ester M, Kriegel HP, Xu X. DBSCAN revisited, revisited: why and how you should still use DBSCAN. ACM Transactions on Database Systems 2017; 42 (3): 1-21. doi: 10.1145/3068335
  • [10] Ankerst M, Breunig MM, Kriegel HP, Sander J. OPTICS: ordering points to identify the clustering structure. ACM Sigmod record 1999; 28 (2): 49-60. doi: 10.1145/304181.304187
  • [11] Hinneburg A, Gabriel HH. Denclue 2.0: Fast clustering based on kernel density estimation. In: 7th International Symposium on Intelligent Data Analysis; Ljubljana, Slovenia; 2007. pp. 70-80. doi: 10.1007/978-3-540-74825-0_7
  • [12] Rehioui H, Idrissi A, Abourezq M, Zegrari F. DENCLUE-IM: A new approach for big data clustering. Procedia Computer Science 2016; 83: 560-567. doi: 10.1016/j.procs.2016.04.265
  • [13] Güngör E, Özmen A. Distance and density based clustering algorithm using Gaussian kernel. Expert Systems with Applications 2017; 69: 10-20. doi: 10.1016/j.eswa.2016.10.022
  • [14] Rodriguez A, Laio A. Clustering by fast search and find of density peaks. Science 2014; 344 (6191): 1492-1496. doi: 10.1126/science.1242072
  • [15] Zhu Y, Ting KM, Carman MJ. Density-ratio based clustering for discovering clusters with varying densities. Pattern Recognition 2016; 60: 983-997. doi: 10.1016/j.patcog.2016.07.007
  • [16] Sieranoja S, Fränti P. Fast and general density peaks clustering. Pattern Recognition Letters 2019; 128: 551-558. doi: 10.1016/j.patrec.2019.10.019.
  • [17] Yu H, Chen L, Yao J. A three-way density peak clustering method based on evidence theory. Knowledge-Based Systems 2021; 211: 106532. doi: 10.1016/j.knosys.2020.106532
  • [18] Eldershaw C, Hegland M. Cluster analysis using triangulation. In: Computational Techniques and Applications: CTAC 97; Adelaide, Australia; 1997. pp. 201-208. doi: 10.1142/3834
  • [19] Liu D, Nosovskiy GV, Sourina O. Effective clustering and boundary detection algorithm based on Delaunay triangulation. Pattern Recognition Letters 2008; 29 (9): 1261-1273. doi: 10.1016/j.patrec.2008.01.028
  • [20] Deng M, Liu Q, Cheng T, Shi T. An adaptive spatial clustering algorithm based on Delaunay triangulation. Computers, Environment and Urban Systems 2011; 35 (4): 320-332. doi: j.compenvurbsys.2011.02.003
  • [21] Azzalini A, Torelli N. Clustering via nonparametric density estimation. Statistics and Computing 2007; 17 (1): 71-80. doi: 10.1007/s11222-006-9010-y
  • [22] Kesemen O Tezel Ö, Özkul E, Tiryaki BK. Fuzzy c-Means Directional Clustering (FCMDC) algorithm using trigonometric approximation. Turkish Journal of Electrical Engineering & Computer Sciences 2020; 28 (1): 140-152. doi: 10.3906/elk-1903-118
  • [23] Chiu SL. Fuzzy model identification based on cluster estimation. Journal of Intelligent & Fuzzy Systems 1994; 2 (3): 267-278.
  • [24] Yager RR, Filev DP. Generation of fuzzy rules by mountain clustering. Journal of Intelligent & Fuzzy Systems 1994; 2 (3): 209-219.
  • [25] Estiri H, Omran BA, Murphy SN. kluster: an efficient scalable procedure for approximating the number of clusters in unsupervised learning. Big Data Research 2018; 13: 38-51. doi: 10.1016/j.bdr.2018.05.003
  • [26] Zhang ZW, Liu Z, Martin A, Liu ZG, Zhou K. Dynamic evidential clustering algorithm. Knowledge-Based Systems 2021; 213: 106643. doi: 10.1016/j.knosys.2020.106643
  • [27] Xu D, Tian Y. A comprehensive survey of clustering algorithms. Annals of Data Science 2015; 2 (2): 165-193. doi: 10.1007/s40745-015-0040-1
  • [28] Mostafa SM. Clustering Algorithms: Taxonomy, Comparison, and Empirical Analysis in 2D Datasets. J. Artif. Intell. 2020; 2 (4): 189. doi: 10.32604/jai.2020.014944
  • [29] Jain AK, Law MH. Data clustering: A user’s dilemma. In: PReMI: International Conference on Pattern Recognition and Machine Intelligence; Kolkata, India; 2005. pp. 1-10. doi: 10.1007/11590316_1
  • [30] Gionis A, Mannila H, Tsaparas P. Clustering aggregation. ACM Transactions on Knowledge Discovery from Data 2007; 1 (1): 4-es. doi: 10.1145/1217299.1217303
  • [31] Chang H, Yeung DY. Robust path-based spectral clustering. Pattern Recognition 2008; 41 (1): 191-203. doi: 10.1016/j.patcog.2007.04.010
  • [32] Fränti P, Sieranoja S. K-means properties on six clustering benchmark datasets. Applied Intelligence 2018; 48 (12): 4743-4759. doi: 10.1007/s10489-018-1238-7
  • [33] Thrun MC, Ultsch A. Clustering benchmark datasets exploiting the fundamental clustering problems. Data Brief 2020; 30: 105501. doi: 10.1016/j.dib.2020.105501
  • [34] Cruz MO , Macedo H, Guimaraes A. Grouping similar trajectories for carpooling purposes. In: 2015 Brazilian Conference on Intelligent Systems (BRACIS); Natal, Brazil; 2015. pp. 234-239. doi: 10.1109/BRACIS.2015.36
  • [35] Terrell GR, Scott DW. Variable kernel density estimation. The Annals of Statistics 1992; 20 (3): 1236-1265.
  • [36] Hazelton ML. Variable kernel density estimation. Australian & New Zealand Journal of Statistics 2003; 45 (3): 271-284. doi: 10.1111/1467-842X.00283
  • [37] Bhattacharyya A. On a measure of divergence between two statistical populations defined by their probability distributions. Bulletin of the Calcutta Mathematical Society 1943; 35: 99-109.
  • [38] Wainwright MJ. High-dimensional statistics: A non-asymptotic viewpoint. Cambridge, UK: Cambridge University Press, 2019.
  • [39] Guidoum AC. Kernel Estimator and Bandwidth Selection for Density and its Derivatives: The kedd Package. arXiv preprint 2015. arXiv: 2012.06102
  • [40] Browne M. A geometric approach to non-parametric density estimation. Pattern Recognition 2007; 40 (1): 134-140. doi: 10.1016/j.patcog.2006.05.012
  • [41] Peterka T, Croubois H, Li N, Rangel E, Cappello F. Self-adaptive density estimation of particle data. SIAM Journal on Scientific Computing 2016; 38 (5): 646-666. doi: 10.1137/15M1016308
  • [42] Bar-Shalom Y, Li XR, Kirubarajan T. Estimation with applications to tracking and navigation: theory algorithms and software. USA: John Wiley & Sons, 2004.
  • [43] Fränti P, Sieranoja S. How much can k-means be improved by using better initialization and repeats?. Pattern Recognition 2019; 93: 95-112. doi: 10.1016/j.patcog.2019.04.014
  • [44] Vinh NX, Epps J, Bailey J. Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance. Journal of Machine Learning Research 2010; 11: 2837–2854.
  • [45] Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B et al. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 2011; 12: 2825-2830.
  • [46] Shi J, Malik J. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 2000; 22 (8): 888-905. doi: 10.1109/34.868688
  • [47] Stella XY, Shi J. Multiclass Spectral Clustering. In: Proceedings Ninth IEEE International Conference on Computer Vision; Nice, France; 2003. pp. 313-319. doi: 10.1109/ICCV.2003.1238361
  • [48] Davies DL, Bouldin DW. A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence 1979; PAMI-1(2): 224-227. doi: 10.1109/TPAMI.1979.4766909
  • [49] Kaltofen E, Villard G. On the complexity of computing determinants. Computational Complexity 2005; 13 (3): 91-130. doi: 10.1007/s00037-004-0185-3
Turkish Journal of Electrical Engineering and Computer Sciences-Cover
  • ISSN: 1300-0632
  • Yayın Aralığı: Yılda 6 Sayı
  • Yayıncı: TÜBİTAK
Sayıdaki Diğer Makaleler

Forecasting TV ratings of Turkish television series using a two-level machine learning framework

Busranur Akgul, Tayfun Kucukyilmaz

45-nm CdS QDs photoluminescent filter for photovoltaic conversion efficiency recovery

Victor Manuel JUAREZ LUNA, Carlos VILLA ANGULO, Daniel SAUCEDA CARVAJAL, Ivett ZAVALA GUILLEN, Enrique RODARTE GUAJARDO, Francisco Javier CARRANZA CHAVEZ

Design and manufacture of electromagnetic absorber composed of boric acid-incorporated wastepaper composites

Osman ÇEREZCİ, Filiz KIRDIOĞULLARI, Mesud KAHRİMAN, Ahmet ÇİFCİ, Ali İhsan KAYA

A new classification method using soft decision-making based on an aggregation operator of fuzzy parameterized fuzzy soft matrices

Uğur ERKAN, Samet MEMİŞ, Serdar ENGİNOĞLU

Identification of gain and phase margins based robust stability regions for a time-delayed micro-grid system including fractional-order controller in presence of renewable power generation

Saffet AYASUN, Şahin SÖNMEZ, Hakan GÜNDÜZ

Event-related microblog retrieval in Turkish

Çağrı TORAMAN

Modeling and evaluation of SOC-based coordinated EV charging for power management in a distribution system

Ramazan BAYINDIR, Murat AKIL, Emrah DOKUR

Calculating influence based on the fusion of interest similarity and information dissemination ability

Ziming Wang, Meng Qian, and Xin Zheng, Shan Yang, Shulin Cheng

A new similarity-based multicriteria recommendation algorithm based on autoencoders

Zeynep BATMAZ, Cihan KALELİ

On an electrostatic micropump with a rigorous mathematical model

Fatih DIKMEN, Ibrahim EFE, Yury TUCHKIN