Clustering with density based initialization and Bhattacharyya based merging

Centroid based clustering approaches, such as k-means, are relatively fast but inaccurate for arbitrary shape clusters. Fuzzy c-means with Mahalanobis distance can accurately identify clusters if data set can be modelled by a mixture of Gaussian distributions. However, they require number of clusters apriori and a bad initialization can cause poor results. Density based clustering methods, such as DBSCAN, overcome these disadvantages. However, they may perform poorly when the dataset is imbalanced. This paper proposes a clustering method, named clustering with density initialization and Bhattacharyya based merging based on the fuzzy clustering. The initialization is carried out by density estimation with adaptive bandwidth using k-Nearest Orthant-Neighbor algorithm to avoid the effects of imbalanced clusters. The local peaks of the point clouds constructed by the k-Nearest Orthant-Neighbor algorithm are used as initial cluster centers for the fuzzy clustering. We use Bhattacharyya measure and Jensen inequality to find overlapped Gaussians and merge them to form a single cluster. We carried out experiments on a variety of datasets and show that the proposed algorithm has remarkable advantages especially for imbalanced and arbitrarily shaped data sets.

PDF

___

[1] Forgy EW. Cluster analysis of multivariate data: efficiency versus interpretability of classifications. Biometrics 1965; 21: 768-769.
[2] Bandyopadhyay A, Deb K, Das A, Bag R. Impulse noise removal by k-means clustering identified fuzzy filter: a new approach. Turkish Journal of Electrical Engineering & Computer Sciences 2020; 28 (5): 2838-2862. doi: 10.3906/elk-1910-34
[3] Rezaee MJ, Eshkevari M, Saberi M, Hussain O. GBK-means clustering algorithm: An improvement to the K-means algorithm based on the bargaining game. Knowledge-Based Systems 2021; 213: 106672. doi: 10.1016/j.knosys.2020.106672
[4] Bezdek JC, Ehrlich R, Full W. FCM: The fuzzy c-means clustering algorithm. Computers & Geosciences 1984; 10 (2-3): 191-203. doi: 10.1016/0098-3004(84)90020-7
[5] Gueorguieva N, Valova I, Georgiev G. M&MFCM: fuzzy c-means clustering with Mahalanobis and Minkowski distance metrics. Procedia Computer Science 2017; 114: 224-233. doi: 10.1016/j.procs.2017.09.064
[6] Tellaroli P, Bazzi M, Donato M, Brazzale AR, Drăghici S. Cross-clustering: a partial clustering algorithm with automatic estimation of the number of clusters. PloS One 2016; 11 (3): e0152333. doi: 10.1371/journal.pone.0152333
[7] Gupta A, Datta S, Das S. Fast automatic estimation of the number of clusters from the minimum inter-center distance for k-means clustering. Pattern Recognition Letters 2018; 116: 72-79. doi: 10.1016/j.patrec.2018.09.003
[8] Ester M, Kriegel HP, Sander J, Xu X. A density-based algorithm for discovering clusters in large spatial databases with noise. In: KDD’96 Proceedings of the Second International Conference on Knowledge Discovery and Data Mining; Portland, Oregon, USA; 1996. vol. 96, pp. 226-231. doi: 10.5555/3001460.3001507
[9] Schubert E, Sander J, Ester M, Kriegel HP, Xu X. DBSCAN revisited, revisited: why and how you should still use DBSCAN. ACM Transactions on Database Systems 2017; 42 (3): 1-21. doi: 10.1145/3068335
[10] Ankerst M, Breunig MM, Kriegel HP, Sander J. OPTICS: ordering points to identify the clustering structure. ACM Sigmod record 1999; 28 (2): 49-60. doi: 10.1145/304181.304187
[11] Hinneburg A, Gabriel HH. Denclue 2.0: Fast clustering based on kernel density estimation. In: 7th International Symposium on Intelligent Data Analysis; Ljubljana, Slovenia; 2007. pp. 70-80. doi: 10.1007/978-3-540-74825-0_7
[12] Rehioui H, Idrissi A, Abourezq M, Zegrari F. DENCLUE-IM: A new approach for big data clustering. Procedia Computer Science 2016; 83: 560-567. doi: 10.1016/j.procs.2016.04.265
[13] Güngör E, Özmen A. Distance and density based clustering algorithm using Gaussian kernel. Expert Systems with Applications 2017; 69: 10-20. doi: 10.1016/j.eswa.2016.10.022
[14] Rodriguez A, Laio A. Clustering by fast search and find of density peaks. Science 2014; 344 (6191): 1492-1496. doi: 10.1126/science.1242072
[15] Zhu Y, Ting KM, Carman MJ. Density-ratio based clustering for discovering clusters with varying densities. Pattern Recognition 2016; 60: 983-997. doi: 10.1016/j.patcog.2016.07.007
[16] Sieranoja S, Fränti P. Fast and general density peaks clustering. Pattern Recognition Letters 2019; 128: 551-558. doi: 10.1016/j.patrec.2019.10.019.
[17] Yu H, Chen L, Yao J. A three-way density peak clustering method based on evidence theory. Knowledge-Based Systems 2021; 211: 106532. doi: 10.1016/j.knosys.2020.106532
[18] Eldershaw C, Hegland M. Cluster analysis using triangulation. In: Computational Techniques and Applications: CTAC 97; Adelaide, Australia; 1997. pp. 201-208. doi: 10.1142/3834
[19] Liu D, Nosovskiy GV, Sourina O. Effective clustering and boundary detection algorithm based on Delaunay triangulation. Pattern Recognition Letters 2008; 29 (9): 1261-1273. doi: 10.1016/j.patrec.2008.01.028
[20] Deng M, Liu Q, Cheng T, Shi T. An adaptive spatial clustering algorithm based on Delaunay triangulation. Computers, Environment and Urban Systems 2011; 35 (4): 320-332. doi: j.compenvurbsys.2011.02.003
[21] Azzalini A, Torelli N. Clustering via nonparametric density estimation. Statistics and Computing 2007; 17 (1): 71-80. doi: 10.1007/s11222-006-9010-y
[22] Kesemen O Tezel Ö, Özkul E, Tiryaki BK. Fuzzy c-Means Directional Clustering (FCMDC) algorithm using trigonometric approximation. Turkish Journal of Electrical Engineering & Computer Sciences 2020; 28 (1): 140-152. doi: 10.3906/elk-1903-118
[23] Chiu SL. Fuzzy model identification based on cluster estimation. Journal of Intelligent & Fuzzy Systems 1994; 2 (3): 267-278.
[24] Yager RR, Filev DP. Generation of fuzzy rules by mountain clustering. Journal of Intelligent & Fuzzy Systems 1994; 2 (3): 209-219.
[25] Estiri H, Omran BA, Murphy SN. kluster: an efficient scalable procedure for approximating the number of clusters in unsupervised learning. Big Data Research 2018; 13: 38-51. doi: 10.1016/j.bdr.2018.05.003
[26] Zhang ZW, Liu Z, Martin A, Liu ZG, Zhou K. Dynamic evidential clustering algorithm. Knowledge-Based Systems 2021; 213: 106643. doi: 10.1016/j.knosys.2020.106643
[27] Xu D, Tian Y. A comprehensive survey of clustering algorithms. Annals of Data Science 2015; 2 (2): 165-193. doi: 10.1007/s40745-015-0040-1
[28] Mostafa SM. Clustering Algorithms: Taxonomy, Comparison, and Empirical Analysis in 2D Datasets. J. Artif. Intell. 2020; 2 (4): 189. doi: 10.32604/jai.2020.014944
[29] Jain AK, Law MH. Data clustering: A user’s dilemma. In: PReMI: International Conference on Pattern Recognition and Machine Intelligence; Kolkata, India; 2005. pp. 1-10. doi: 10.1007/11590316_1
[30] Gionis A, Mannila H, Tsaparas P. Clustering aggregation. ACM Transactions on Knowledge Discovery from Data 2007; 1 (1): 4-es. doi: 10.1145/1217299.1217303
[31] Chang H, Yeung DY. Robust path-based spectral clustering. Pattern Recognition 2008; 41 (1): 191-203. doi: 10.1016/j.patcog.2007.04.010
[32] Fränti P, Sieranoja S. K-means properties on six clustering benchmark datasets. Applied Intelligence 2018; 48 (12): 4743-4759. doi: 10.1007/s10489-018-1238-7
[33] Thrun MC, Ultsch A. Clustering benchmark datasets exploiting the fundamental clustering problems. Data Brief 2020; 30: 105501. doi: 10.1016/j.dib.2020.105501
[34] Cruz MO , Macedo H, Guimaraes A. Grouping similar trajectories for carpooling purposes. In: 2015 Brazilian Conference on Intelligent Systems (BRACIS); Natal, Brazil; 2015. pp. 234-239. doi: 10.1109/BRACIS.2015.36
[35] Terrell GR, Scott DW. Variable kernel density estimation. The Annals of Statistics 1992; 20 (3): 1236-1265.
[36] Hazelton ML. Variable kernel density estimation. Australian & New Zealand Journal of Statistics 2003; 45 (3): 271-284. doi: 10.1111/1467-842X.00283
[37] Bhattacharyya A. On a measure of divergence between two statistical populations defined by their probability distributions. Bulletin of the Calcutta Mathematical Society 1943; 35: 99-109.
[38] Wainwright MJ. High-dimensional statistics: A non-asymptotic viewpoint. Cambridge, UK: Cambridge University Press, 2019.
[39] Guidoum AC. Kernel Estimator and Bandwidth Selection for Density and its Derivatives: The kedd Package. arXiv preprint 2015. arXiv: 2012.06102
[40] Browne M. A geometric approach to non-parametric density estimation. Pattern Recognition 2007; 40 (1): 134-140. doi: 10.1016/j.patcog.2006.05.012
[41] Peterka T, Croubois H, Li N, Rangel E, Cappello F. Self-adaptive density estimation of particle data. SIAM Journal on Scientific Computing 2016; 38 (5): 646-666. doi: 10.1137/15M1016308
[42] Bar-Shalom Y, Li XR, Kirubarajan T. Estimation with applications to tracking and navigation: theory algorithms and software. USA: John Wiley & Sons, 2004.
[43] Fränti P, Sieranoja S. How much can k-means be improved by using better initialization and repeats?. Pattern Recognition 2019; 93: 95-112. doi: 10.1016/j.patcog.2019.04.014
[44] Vinh NX, Epps J, Bailey J. Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance. Journal of Machine Learning Research 2010; 11: 2837–2854.
[45] Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B et al. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 2011; 12: 2825-2830.
[46] Shi J, Malik J. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 2000; 22 (8): 888-905. doi: 10.1109/34.868688
[47] Stella XY, Shi J. Multiclass Spectral Clustering. In: Proceedings Ninth IEEE International Conference on Computer Vision; Nice, France; 2003. pp. 313-319. doi: 10.1109/ICCV.2003.1238361
[48] Davies DL, Bouldin DW. A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence 1979; PAMI-1(2): 224-227. doi: 10.1109/TPAMI.1979.4766909
[49] Kaltofen E, Villard G. On the complexity of computing determinants. Computational Complexity 2005; 13 (3): 91-130. doi: 10.1007/s00037-004-0185-3