lFIT: an unsupervised discretization method based on the Ramer–Douglas–Peucker algorithm

lFIT: an unsupervised discretization method based on the Ramer–Douglas–Peucker algorithm

Discretization is the process of converting continuous values into discrete values. It is a preprocessingstep of several machine learning and data mining algorithms and the quality of discretization may drastically affectthe performance of these algorithms. In this study we propose a discretization algorithm, namely line fitting-baseddiscretization (lFIT), based on the Ramer–Douglas–Peucker algorithm. It is a static, univariate, unsupervised, splittingbased, global, and incremental discretization method where intervals are determined based on the Ramer–Douglas–Peucker algorithm and the quality of partitioning is assessed based on the standard error of the estimate. To evaluatethe performance of the proposed method, a set of experiments are conducted on ten benchmark datasets and the achievedresults are compared to those obtained by eight state-of-the-art discretization methods. Experimental results show thatlFIT achieves higher predictive accuracy and produces less number of inconsistency while it generates larger numberof intervals. The obtained results are also validated through Friedman’s test and Holm’s post hoc test which revealedthe fact that lFIT produces discretization schemes that statistically comply both with supervised and unsuperviseddiscretization methods.

___

  • [1] Kaufman KA, Michalski RS. Learning from inconsistent and noisy data: the AQ18 approach. In: Ras ZW, Kowron A, editors. International Symposium on Methodologies for Intelligent Systems. Warsaw, Poland: Springer, 1999. pp. 411-419.
  • [2] Cios KJ, Kurgan LA. CLIP4: Hybrid inductive machine learning algorithm that generates inequality rules. Inf Sci 2004; 163(1-3): 37-83.
  • [3] Hsu CC, Huang YP, Chang KW. Extended naive Bayes classifier for mixed data. Expert Syst Appl 2008; 35(3): 1080-1083.
  • [4] Tillander A. Effect of data discretization on the classification accuracy in a high-dimensional framework. Int J Intell Syst 2012; 27(4): 355-374.
  • [5] Liu H, Hussain F, Tan CL, Dash M. Discretization: an enabling technique. Data Min Knowl Discov 2002; 6(4): 393-423.
  • [6] Coussement K, Lessmann S, Verstraeten G. A comparative analysis of data preparation algorithms for customer churn prediction: a case study in the telecommunication industry. Decision Support Systems 2017; 95: 27-36.
  • [7] Kotsiantis SB, Zaharakis I, Pintelas P. Supervised machine learning: a review of classification techniques. Emerging Artificial Intelligence Applications in Computer Engineering 2007; 160: 3-24.
  • [8] Han J, Pei J, Kamber M. Data Mining: Concepts and techniques. 3rd ed. Waltham, MA, USA: Morgan Kaufmann, 2011.
  • [9] Ramer U. An iterative procedure for the polygonal approximation of plane curves. Computer Graphics and Image Processing 1972; 1(3): 244-256.
  • [10] Douglas DH, Peucker TK. Algorithms for the reduction of the number of points required to represent a digitized line or its caricature. Cartographica: The International Journal for Geographic Information and Geovisualization 1973; 10(2): 112-122.
  • [11] Holte RC. Very simple classification rules perform well on most commonly used datasets. Machine Learning 1993; 11(1): 63-90.
  • [12] Tsai CJ, Lee CI, Yang WP. A discretization algorithm based on class-attribute contingency coefficient. Inf Sci 2008; 178(3): 714-731.
  • [13] Kurgan LA, Cios KJ. CAIM discretization algorithm. IEEE Trans Knowl Data Eng 2004; 16(2): 145-153.
  • [14] Boulle M. MODL: a Bayes optimal discretization method for continuous attributes. Machine Learning 2006; 65(1): 131-165.
  • [15] Fayyad U, Irani K. Multi-interval discretization of continuous-valued attributes for classification learning. In: Bajcsy R, editor. International Joint Conference on Artificial Intelligence. Chambery, France: Morgan Kaufmann, 1993, pp. 1022-1029.
  • [16] Lee CH. A Hellinger-based discretization method for numeric attributes in classification learning. Knowl-Based Syst 2007; 20(4): 419-425.
  • [17] Tahan MH, Asadi S. MEMOD: a novel multivariate evolutionary multi-objective discretization. Soft Comput 2018; 22(1): 3013-3023.
  • [18] Garcia S, Luengo J, Sáez JA, Lopez V, Herrera F. A survey of discretization techniques: taxonomy and empirical analysis in supervised learning. IEEE Trans Knowl Data Eng 2013; 25(4): 734-750.
  • [19] Cano A, Nguyen DT, Ventura S, Cios KJ. ur-CAIM: improved CAIM discretization for unbalanced and balanced data. Soft Comput 2016; 20(1): 173-188.
  • [20] Jiang SY, Li X, Zheng Q, Wang LX. Approximate equal frequency discretization method. In: 2009 WRI World Congress on Computer Science and Information Engineering; March 31–April 2 2009; Los Angeles, CL, USA. IEEE Computer Society. pp. 514-518.
  • [21] Kotsiantis S, Kanellopoulos D. Discretization techniques: a recent survey. GESTS International Transactions on Computer Science and Engineering 2006; 32(1): 47-58.
  • [22] Ramirez-Gallego S, Garcia S, Mourino-Talin H, Martinez-Rego D, Bolon-Canedo V, Alonso-Betanzos A, Benitez JM, Herrera F. Data discretization: taxonomy and big data challenge. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 2016; 6(1): 5-21.
  • [23] Nguyen HV, Müller E, Vreeken J, Böhm K. Unsupervised interaction-preserving discretization of multivariate data. Data Min Knowl Discov 2014; 28(5-6): 1366-1397.
  • [24] Hacibeyoglu M, Ibrahim MH. EF_Unique: an improved version of unsupervised equal frequency discretization method. Arabian Journal for Science and Engineering 2018; 2018: 1-10.
  • [25] Rahman MG, Islam MZ. Discretization of continuous attributes through low frequency numerical values and attribute interdependency. Expert Syst Appl 2016; 45: 410-423.
  • [26] Ramirez-Gallego S, Garcia S, Benitez JM, Herrera F. Multivariate discretization based on evolutionary cut points selection for classification. IEEE Trans Cybernetics 2016; 46(3): 595-608.
  • [27] Cano A, Luna JM, Gibaja EL, Ventura S. LAIM discretization for multi-label data. Inf Sci 2016; 330: 370-384.
  • [28] Moskovitch R, Shahar Y. Classification-driven temporal discretization of multivariate time series. Data Min Knowl Discov 2015; 29(4): 871-913.
  • [29] Abachi HM, Hosseini S, Maskouni MA, Kangavari M, Cheung NM. Statistical discretization of continuous attributes using Kolmogorov-Smirnov test. In: Wang J, Chen J, Qi J, editors. Australasian Database Conference. Gold Coast, QLD, Australia: Springer, 2018. pp. 309-315.
  • [30] Kurtcephe M, Güvenir HA. A discretization method based on maximizing the area under receiver operating characteristic curve. IJPRAI 2013; 27(01):1350002.
  • [31] Gupta A, Mehrotra KG, Mohan C. A clustering-based discretization for supervised learning. Statistics & probability letters 2010; 80(9-10): 816-824.
  • [32] Sang Y, Qi H, Li K, Jin Y, Yan D, Gao S. An effective discretization method for disposing high-dimensional data. Inf Sci 2014; 270: 73-91.
  • [33] Garcia S, Lopez V, Luengo J, Carmona CJ, Herrera FA. Preliminary study on selecting the optimal cut points in discretization by evolutionary algorithms. In: Carmona PL, Sanchez JS, editors. International Conference on Pattern Recognition Applications and Methods. Algarve, Portugal: SciTePress, 2012. pp. 211-216.
  • [34] Keogh E, Chu S, Hart D, Pazzani M. Segmenting time series: a survey and novel approach. Data mining in time series databases 2004; 57: 1-21.
  • [35] Fu TC. A review on time series data mining. Eng Appl of AI 2011; 24(1): 164-181.
  • [36] Cao H, Mamoulis N, Cheung DW. Mining frequent spatio-temporal sequential patterns. In: IEEE International Conference on Data Mining; 27-30 November 2005; Houston, TX, USA. IEEE Computer Society. pp. 82-89.
  • [37] Cao H, Wolfson O, Trajcevski G. Spatio-temporal data reduction with deterministic error bounds. VLDB J 2006; 15(3): 211-228.
  • [38] Jiang X, Bunke H. Fast segmentation of range images into planar regions by scan line grouping. Mach Vis Appl 1994; 7(2): 115-122.
  • [39] Gutmann JS, Fukuchi M, Fujita M. 3D perception and environment map generation for humanoid robot navigation. I J Robotics Res 2008; 27(10): 1117-1134.
  • [40] He X, Cai D, Niyogi P. Laplacian score for feature selection. In: Advances in Neural Information Processing Systems. 2006. pp. 507-514.
  • [41] Dash M, Choi K, Scheuermann P, Liu H. Feature selection for clustering-a filter solution. In: IEEE International Conference on Data Mining; 9-12 December 2002; Maebashi City, Japan. IEEE Computer Society. pp. 115-122.