DTreeSim: A new approach to compute decision tree similarity using re-mining

DTreeSim: A new approach to compute decision tree similarity using re-mining

A number of recent studies have used a decision tree approach as a data mining technique; some of them needed to evaluate the similarity of decision trees to compare the knowledge reflected in different trees or datasets. There have been multiple perspectives and multiple calculation techniques to measure the similarity of two decision trees, such as using a simple formula or an entropy measure. The main objective of this study is to compute the similarity of decision trees using data mining techniques. This study proposes DTreeSim, a new approach that applies multiple data mining techniques (classification, sequential pattern mining, and k-nearest neighbors) sequentially to identify similarities among decision trees. After the construction of decision trees from different data marts using a classification algorithm, sequential pattern mining was applied to the decision trees to obtain rules, and then the k-nearest neighbor algorithm was performed on these rules to compute similarities using two novel measures: general similarity and pieced similarity. Our experimental studies compared the results of these novel similarity measures and also compared our approach with existing approaches. Our comparisons indicate that our proposed approach performs better than existing approaches, because it takes into account the values of the branches in the trees through sequential pattern mining.

___

  • [1] Quinlan JR. C4.5: Programs for Machine Learning. San Francisco, CA, USA: Morgan Kaufmann Publishers, 1993.
  • [2] Quinlan JR. Improved use of continuous attributes in C4.5. J Artif Intell Res 1996; 4: 77-90.
  • [3] Wu X, Kumar V, Quinlan JR, Ghosh J, Yang Q, Motoda H, McLachlan GJ, Ng A, Liu B, Yu PS et al. Top 10 algorithms in data mining. Knowl Inf Syst 2008; 14: 1-37.
  • [4] Dudani SA. The distance-weighted k-nearest-neighbor rule. IEEE T Syst Man Cyb 1976; 6: 325-327.
  • [5] Pei J, Han J, Mortazavi-Asl B, Pinto H, Chen Q, Dayal U, Hsu M. PrefixSpan: Mining sequential patterns by prefixprojected growth. In: 17th International Conference on Data Engineering; 2–6 April 2001; Heidelberg, Germany,New York, NY, USA: IEEE. pp. 215-224.
  • [6] Fletcher S, Islam MZ. An anonymization technique using intersected decision trees. Journal of King Saud University – Computer and Information Sciences 2015; 27: 297-304.
  • [7] Dogra IS. Improving and maintaining prediction accuracy in agent based modeling systems under dynamic environment. MSc, University of Windsor, Windsor, Canada, 2014.
  • [8] Papagelis A, Kalles D. GATree: Genetically evolved decision trees. In: 12th IEEE International Conference on Tools with Artificial Intelligence; 13–15 November 2000; Vancouver, BC, Canada. New York, NY, USA: IEEE. pp.203-206.
  • [9] Telaar D, Fuhs MC. Accent- and speaker-specific polyphone decision trees for non-native speech recognition. In: Interspeech 2013, 14th Annual Conference of the International Speech Communication Association; 25–29 August 2013; Lyon, France. pp. 3313-3316.
  • [10] Kumari A, Prasad U, Bala PK. Retail forecasting using neural network and data mining technique: a review and reflection. Int J Emerg Trends Technol Comp Sci 2013; 2: 266-269.
  • [11] Fiol-Roig G, Miro-Julia M, Isern-Deya AP. Applying data mining techniques to stock market analysis. Adv Intel Soft Comput 2010; 71: 519-527.
  • [12] Thangamani M, Thangaraj P, Bannari. Automatic medical disease treatment system using datamining. In: 2013 International Conference on Information Communication and Embedded Systems; 21– 22 February 2013; Chennai, India. New York, NY, USA: IEEE. pp. 120-125.
  • [13] Souza FT, Wang Z. A data mining approach to predict mass movements induced by seismic events in Sichuan, China. In: 2010 Sixth International Conference on Natural Computation; 10–12 August 2010; Yantai, China. New York, NY, USA: IEEE. pp. 1172-1177.
  • [14] Olaiya F, Adeyomo AB. Application of data mining techniques in weather prediction and climate change studies. Int J Inform Eng Electron Bus 2012; 4: 51-59.
  • [15] Solomon S, Nguyen H, Liebowitz J, Agresti W. Using data mining to improve traffic safety programs. Ind Manage Data Syst 2006; 106: 621-643.
  • [16] Spielman SE, Thill J. Social area analysis, data mining and GIS. Comput Environ Urban 2007; 32: 110 - 122.
  • [17] Ntoutsi I, Kalousis A, Theodoridis Y. A general framework for estimating similarity of datasets and decision trees: exploring semantic similarity of decision trees. In: SIAM International Conference on Data Mining; 24–26 April 2008; Atlanta, GA, USA. Philadelphia, PA, USA: SIAM. pp. 810-821.
  • [18] Zhang X, Jiang S. A splitting criteria based on similarity in decision tree learning. J Softw 2012; 7: 1775-1782.
  • [19] Islam MZ, Barnaghi PM, Brankovic L. Measuring data quality: predictive accuracy vs. similarity of decision trees. In: 6th International Conference on Computer & Information Technology; 19–21 December 2003; Dhaka, Bangladesh. pp. 457-462.
  • [20] Last M, Maimon O, Minkov E. Improving stability of decision trees. Int J Pattern Recogn 2002; 16: 145-159.
  • [21] Yoshida T. Term usage difference in a single dataset through collaborative registration. In: Columbus AM, editor. Advances in Psychology Research. Hauppauge, NY, USA: Nova Science Publishers, 2008. pp. 153-170.
  • [22] Peahringer B, Witten IH. Improving Bagging Performance By Increasing Decision Tree Diversity. Boston, MA, USA: Kluwer Academic Publishers.
  • [23] Perner P. How to compare and interpret two learnt decision trees from the same domain? In: 27th International Conference on Advanced Information Networking and Applications Workshops; 25–28 March 2013; Barcelona, Spain. New York, NY, USA: IEEE. pp. 318-322.
  • [24] Perner P. How to interpret decision trees? Lect Notes Comput Sc 2011; 6870: 40-55.
  • [25] Pekerskaya I, Pei J, Wang K. Mining changing regions from access-constrained snapshots: a cluster-embedded decision tree approach. J Intell Inf Syst 2006; 27: 215-242.
  • [26] Islam MZ. Privacy preservation in data mining through noise addition. PhD, University of Newcastle, Newcastle, Australia, 2008.
  • [27] Papagelis A, Kalles D. Breeding decision trees using evolutionary techniques. In: 18th International Conference on Machine Learning; 28 June–1 July 2001; Williamstown, MA, USA. Burlington, MA, USA: Morgan Kaufmann Publishers. pp. 393-400.
  • [28] Ma S, Tang S, Yang D, Wang T, Han J. Combining clustering with moving sequential pattern mining: a novel and efficient technique. Lect Notes Artif Int 2004; 3056: 419-423.
  • [29] Exarchos TP, Papaloukas C, Lampros C, Fotiadis DI. Mining sequential patterns for protein fold recognition. J Biomed Inform 2008; 41: 165-179.
  • [30] Tseng VS, Lee CH. Effective temporal data classification by integrating sequential pattern mining and probabilistic induction. Expert Syst Appl 2009; 36: 9524-9532.
  • [31] D’Silva MR, Vora D. Intelligent recommendation system using clustering and closed sequential pattern mining. International Journal of Computer Science Engineeringand Information Technology Research 2014; 4: 133-140.
  • [32] Deng H, Runger G, Tuv E, Bannister W. CBC: An associative classifier with a small number of rules. Decis Support Syst 2014; 59: 163-170.
  • [33] Ma Z, Kaban A. K-Nearest neighbours with a novel similarity measure for intrusion detection. In: 13th UK Workshop on Computational Intelligence; 9–11 September 2013; Guildford, UK. New York, NY, USA: IEEE. pp. 266-271.
  • [34] Liu J, Yin J. Towards efficient data re-mining (DRM). Lect Notes Artif Int 2001; 2035: 406-412.
  • [35] Demiriz A, Ertek G, Atan T, Kula U. Re-mining item associations: methodology and a case study in apparel retailing. Decis Support Syst 2011; 52: 284-293.
  • [36] Fournier-Viger P, Gomariz A, Gueniche T, Soltani A, Wu C, Tseng VS. SPMF: A Java open-source pattern mining library. J Mach Learn Res 2014; 15: 3389-3393.
  • [37] Seni G, Elder JF. Ensemble Methods in Data Mining: Improving Accuracy Through Combining Predictions. San Rafael, CA, USA: Morgan and Claypool Publishers, 2010.
  • [38] Deng H. Interpreting Tree Ensembles with inTrees. Technical Report. arXiv:1408.5456, 2014.
  • [39] Ishwaran H, Rao, J. Decision trees, advanced techniques in constructing. In: Kattan M, editor. Encyclopedia of Medical Decision Making. London, UK: SAGE Publishing, 2009. pp. 328-332.