Ömer AKGÖBEK

A rule induction algorithm for knowledge discovery and classification

Classification and rule induction are key topics in the fields of decision making and knowledge discovery. The objective of this study is to present a new algorithm developed for automatic knowledge acquisition in data mining. The proposed algorithm has been named RES-2 (Rule Extraction System). It aims at eliminating the pitfalls and disadvantages of the techniques and algorithms currently in use. The proposed algorithm makes use of the direct rule extraction approach, rather than the decision tree. For this purpose, it uses a set of examples to induce general rules. In this study, 15 datasets consisting of multiclass values with different properties and sizes and obtained from the University of California, Irvine, have been used. Classification accuracy and rule count have been used to test the proposed method. This method presents an alternative 3-step method to classify categorical, binary, and continuous data by taking advantage of algorithms for data mining classification and decision rule generation. The method aims at improving the classification accuracy of the algorithms that extract the decision rules. Experimental studies were conducted on the benchmark datasets and the results of the comparisons with some known algorithms for decision rule generation have shown that the proposed method performs classification with a higher accuracy and generates fewer rules.

Anahtar Kelimeler:

Knowledge discovery, rule extraction, classification, data mining

A rule induction algorithm for knowledge discovery and classification

Keywords:

Knowledge discovery, rule extraction, classification, data mining,

PDF

___

RES-2 43 86 ± 25 This study MCAR 54 16 Thabtah et al. [9] PAT-DTL (AGM) 40 Kang and Sohn [32] CBA 66 15 Thabtah et al. [9] Ripper 56 17 Thabtah et al. [9] C5 32 33 Thabtah et al. [9] Table Comparison of RES-2 with other classifiers for the Breast Cancer Wisconsin dataset. Algorithm Average accuracy Number of rules References CBA 84 45 Thabtah and Cowling [11] CPOM-NB 42 1 Cohen et al. [7] NBTree 56 28 Cohen et al. [7] RES-2 54 33 ± 70 This study MCAR 48 61 Thabtah et al. [9] SVM+Pr 1 Martens et al. [33] Ant-Miner 04 2 ± 0.25 Su et al. [8] RMR 92 60 Thabtah and Cowling [11] Ripper 42 6 Thabtah and Cowling [11] G-REX 10 2 Martens et al. [33] Trepan 00 4 Martens et al. [33] CN2 88 6 ± 0.45 Su et al. [8] C5 66 14 Thabtah and Cowling [11] DE/QDE 68 8 ± 08 Su et al. [8] Table Comparison of RES-2 with other classifiers for the Breast Cancer dataset. Algorithm Average accuracy Number of rules References RES-2 18 50 ± 0.96 This study C0 80 9 Pham and Afify [34] DE/QDE 52 30 ± 19 Su et al. [8] Ant-Miner 28 10 ± 0.31 Su et al. [8] PAT-DTL (JS) 43 Kang and Sohn [32] Rules-6 60 10 Pham and Afify [34] Ant-Miner w/o rule pruning 69 60 ± 0.22 Su et al. [8] PAT-DTL (AGM) 23 Kang and Sohn [32] Rules-3 Plus 40 40 Pham and Afify [34] CN2 69 40 ± 07 Su et al. [8] Table Comparison of RES-2 with other classifiers for the Chess dataset. Algorithm Average accuracy Number of rules References Rules-3Plus 0 108 Pham and Afify [34] RES-2 8 00 ± 1 This study Rules-6 5 31 Pham and Afify [34] C0 2 21 Pham and Afify [34] ID3 9 Nielsen et al. [35] C5 9 Nielsen et al. [35] DIR 9 Nielsen et al. [35] Table 10. Comparison of RES-2 with other classifiers for the CRX dataset. Algorithm Average accuracy Number of rules References MOEA 48 Setzkorn and Paton [18] C5 79 Carvalho and Freitas [36] C5/GA-large-SN 40 Carvalho and Freitas [36] Double C4.5 02 Carvalho and Freitas [36] RES-2 05 43 ± 26 This study C5/GA-small 94 Carvalho and Freitas [36] SVM 92 Setzkorn and Paton [18] CBA 50 Chan et al. [37] GARC 50 Chan et al. [37] EROL 63 Setzkorn and Paton [18] S-NBTree 76 Wang et al. [19] NBTree 42 Wang et al. [19] Table 11. Comparison of RES-2 with other classifiers for the Dermatology dataset. Algorithm Average accuracy Number of rules References GP 60 Bojarczuk [17] DIMLP 70 Bologna [16] RES-2 88 33 ± 25 This study MLP 70 Bologna [16] Ant-Miner 29 30 ± 0.15 Su et al. [8] DE/QDE 53 90 ± 02 Su et al. [8] CN2 38 50 ± 0.47 Su et al. [8] C5 10 Bojarczuk [17] BGP 20 Bojarczuk [17] Ant-Miner w/o rule pruning 05 90 ± 0.31 Su et al. [8] Table 12. Comparison of RES-2 with other classifiers for the Diabetes dataset. Algorithm Average accuracy Number of rules References C5 82 20 Thabtah and Cowling [11] RMR 79 65 Thabtah and Cowling [11] RES-2 71 29 ± 0.45 This study Ripper 04 4 Thabtah and Cowling [11] CBA 34 36 Thabtah and Cowling [11] CORE 34 Tan et al. [20] NavieBayes 09 Tan et al. [20] Cal5 00 Tan et al. [20] CART 50 Tan et al. [20] GGP 60 Tan et al. [20] AC 2 40 Tan et al. [20] CN2 10 Tan et al. [20] Table 13. Comparison of RES-2 with other classifiers for the Hepatitis dataset. Algorithm Average accuracy Number of rules References Ant-Miner w/o rule pruning 50 80 ± 0.13 Su et al. [8] DE/QDE 97 30 ± 0.64 Su et al. [8] Ant-Miner 00 40 ± 0.16 Su et al. [8] CN2 00 20 ± 0.25 Su et al. [8] RES-2 22 48 ± 0.64 This study C5/GA-large-SN 52 Carvalho and Freitas [36] TFPC 20 9 Coenen and Leng [10] CMAR 00 11 Coenen and Leng [10] C5/GA-small 36 Carvalho and Freitas [36] Double C4.5 16 Carvalho and Freitas [36] CBA 80 8 Coenen and Leng [10] Table 14. Comparison results of RES-2 with other classifiers for the IRIS dataset. Algorithm Average accuracy Number of rules References REGAL 00 11 Rodr´ıguez et al. [14] RES-2 61 00 ± 0.76 This study Trepan 20 7 Martens et al. [33] SVM+Pr 00 7 Martens et al. [33] CPOM-NB 00 2 Cohen et al. [7] EDGAR 00 14 Rodr´ıguez et al. [14] CLIP4 60 4 Cios and Kurgan [12] MCAR 32 31 Thabtah et al. [9] C5 10 3 Martens et al. [33] G-REX 80 0 Martens et al. [33] Ripper 66 4 Thabtah and Cowling [11] NBTree 00 4 Cohen et al. [7] RMR 87 15 Thabtah and Cowling [11] CBA 25 18 Thabtah and Cowling [11] Table 15. Comparison of RES-2 with other classifiers for the Lympograph dataset. Algorithm Average accuracy Number of rules References SIM 20 Luukka [38] Rules-6 00 15 Pham and Afify [34] RES-2 85 17 ± 53 This study CN2 60 Luukka [38], Bologna [16] MLP 60 Luukka [38], Bologna [16] C5 08 12 Thabtah et al. [9] DIMLP 40 Luukka [38], Bologna [16] Rules-3 Plus 00 24 Pham and Afify [34] Ripper 02 6 Thabtah et al. [9] CBA 38 38 Thabtah et al. [9] MCAR 02 49 Thabtah et al. [9] C0 00 7 Pham and Afify [34] Table 16. Comparison of RES-2 with other classifiers for the Mushroom dataset. Algorithm Average accuracy Number of rules References CLIP4 00 2 Cios and Kurgan [12] RES-2 00 14 ± 03 This study Ripper 90 14 Thabtah and Cowling [11], Thabtah et al. [9] C5 77 44 Thabtah and Cowling [11], Thabtah et al. [9] ID3 10 Nielsen et al. [35] DIR 90 Nielsen et al. [35] PDG1 80 Nielsen et al. [35] MCAR 56 53 Thabtah et al. [9] CBA 29 38 Thabtah and Cowling [11], Thabtah et al. [9] Table 17. Comparison of RES-2 with other classifiers for the Nursery dataset. Algorithm Average accuracy Number of rules References EDGAL 00 270 ± 41 Rodr´ıguez et al. [14] REGAL 00 309 ± 41 Rodr´ıguez et al. [14] NBTree 92 139 Cohen et al. [7] RES-2 47 11 ± 09 This Study CPOM-NB 21 15 Cohen et al. [7] CBA 10 4 Coenen and Leng [10] CMAR 30 26 Coenen and Leng [10] NPGA 25 7 Dehuri and Mall [15] TFPC 80 7 Coenen and Leng [10] INPGA 65 7 Dehuri and Mall [15] SGA 20 7 Dehuri and Mall [15] Table 18. Comparison of RES-2 with other classifiers for the Soybean-Large dataset. Algorithm Average accuracy Number of rules References RES-2 14 14 ± 73 This study CBA 00 Coenen and Leng [10] CMAR 80 Coenen and Leng [10] TFPC 10 Coenen and Leng [10] PDG1 20 Nielsen et al. [35] NB 70 Nielsen et al. [35] PAT-DTL(JKL) 41 Kang and Sohn [32] DIR 90 Nielsen et al. [35] Table 19. Comparison of RES-2 with other classifiers for the Tic-Tac-Toe dataset. Algorithm Average accuracy Number of rules References CBA 00 25 Thabtah and Cowling [11] RMR 00 26 Thabtah and Cowling [11] MCAR 00 27 Thabtah et al. [9] EDGAR 00 87 ± 45 Rodr´ıguez et al. [14] DE/QDE 85 10 Su et al. [8] CN2 38 70 ± 52 Su et al. [8] RES-2 10 86 ± 64 This study Ripper 97 9 Thabtah and Cowling [11] REGAL 00 50 ± 12 Rodr´ıguez et al. [14] C5 71 95 Thabtah and Cowling [11] CPOM-NB 51 7 Cohen et al. [7] NBTree 51 Cohen et al. [7] Ant-Miner 04 50 ± 0.62 Su et al. [8] Table 20. Comparison of RES-2 with other classifiers for the Vote dataset. Algorithm Average accuracy Number of rules References RES-2 44 43 ± 0.90 This study Rules-3 Plus 00 33 Pham and Afify [34] C0 00 5 Pham and Afify [34] Rules-6 60 10 Pham and Afify [34] CLIP4 00 7 Cios and Kurgan [12] MCAR 70 85 Thabtah et al. [9] RMR 70 84 Thabtah and Cowling [11] C5 27 4 Thabtah and Cowling [11], Thabtah et al. [9] Ripper 35 4 Thabtah and Cowling [11], Thabtah et al. [9] CBA 91 40 Thabtah and Cowling [11], Thabtah et al. [9] When the results shown in these tables are examined, it can be seen that the proposed RES-2 algorithm both classifies the test set with high accuracy and generates a limited number of rules according to the rules that it obtains from the given training set. For instance, the RES-2 algorithm produced the highest results for a total of 5 datasets: 86.18% accuracy for Breast Cancer, 97.44% for Vote, 94.14% for Soybean-Large, 79.43% for Balance-Scale, and 100% for Mushroom. It produced the second highest results for the Iris and Chess datasets at 61% and 98.8%, respectively, and it produced the third highest results for the Diabetes, Lympograph, and Dermatology datasets. It produced results with high levels of accuracy for the other datasets, as well. Conclusion In this article, a new algorithm directly producing rules with a simple and at the same time effective rule search mechanism has been presented. This algorithm explores all of the cases that could generate rules using the search technique, without performing any calculations in the rule generation stage, and turns them into rules. The search technique is based on testing whether the attributive–value combinations that could be rules correspond to only one class or not. Thus, it generates all of the rules and ensures that the most general rules are chosen. Fewer rules are generated by eliminating unnecessary and repeated rules at the end of the procedure. Furthermore, it classifies the training set with 100% accuracy, as it does not use a pruning method. According to the results obtained using the benchmark datasets taken from real life, the proposed method produces better results than many other algorithms in terms of both the accuracy and rule number. In the future, it may be possible for the algorithm to eliminate unqualified rules and generate fewer rules by applying a pruning procedure and calculating the quality of the obtained rules. Additional issues to be further studied include examining how the proposed algorithm can be implemented using other classification methods such as support vectors machines, ant colony optimization, or Bayesian networks. Acknowledgment All of the datasets were obtained from the University of California at Irvine’s Repository of Machine Learning Databases and Domain Theories, managed by Patrick M Murphy. References ¨ O. Akg¨ obek, Y.S. Aydın, E. ¨ Oztemel, M.S. Aksoy, “A new algorithm for automatic knowledge acquisition in inductive learning”, Knowledge-Based Systems, Vol. 19, pp. 388–395, 2006. K.J. Cios, N. Liu, L.S. Goodenday, “Generation of diagnostic rules via inductive machine learning”, Kybernetes, Vol. 22, pp. 44–56, 1993. D. Fournier, B. Cremilleux, “A quality index for decision tree pruning”, Knowledge-Based Systems, Vol. 15, pp. 37–43, 2002. D.T. Pham, S. Bigot, S.S. Dimov, “A rule merging technique for handling noise in inductive learning”, Proceedings of the Institution of Mechanical Engineers, Part C: Journal of Mechanical Engineering Science, Vol. 218, pp. 1255– 1268, 2004. D.T. Pham, S. Bigot, S.S. Dimov, “Rules-5: A rule induction algorithm for classification problems involving continuous attributes”, Proceedings of the Institution of Mechanical Engineers, Part C: Journal of Mechanical Engineering Science, Vol. 217, pp. 1273–1285, 2003. J.R. Quinlan, C4.5: Programs for Machine Learning”, Morgan Kaufmann, San Mateo, CA, USA, 1993. S. Cohen, L. Rokach, O. Maimon, “Decision-tree instance-space decomposition with grouped gain-ratio”, Information Sciences, Vol. 177, pp. 3592–3612, 2007. H. Su, Y. Yang, L. Zhao, “Classification rule discovery with DE/QDE algorithm”, Expert Systems with Applications, Vol. 37, pp. 1216–1222, 2010. F. Thabtah, P. Cowling, S. Hammoud, “Improving rule sorting, predictive accuracy and training time in associative classification”, Expert Systems with Applications, Vol. 31, pp. 414–426, 2006.
F. Coenen, P. Leng, “The effect of threshold values on association rule based classification accuracy”, Data and Knowledge Engineering, Vol. 60, pp. 345–360, 2007.
F.A. Thabtah, P.I. Cowling, “A greedy classification algorithm based on association rule”, Applied Soft Computing, Vol. 7, pp. 1102–1111, 2007.
K.J. Cios, L.A. Kurgan, “CLIP4: Hybrid inductive machine learning algorithm that generates inequality rules”, Information Sciences, Vol. 163, pp. 37–83, 2004.
¨ O. Akg¨ obek, “A hybrid approach for improving the accuracy of classification algorithms in data mining”, Energy Education Science and Technology Part A: Energy Science and Research, Vol. 29, pp. 1039–1054, 2012.
M. Rodr´ıguez, D.M. Escalantea, A. Peregr´ın, “Efficient distributed genetic algorithm for rule extraction”, Applied Soft Computing, Vol. 11, pp. 733–743, 2011.
S. Dehuri, R. Mall, “Predictive and comprehensible rule discovery using a multiobjective genetic algorithm”, Knowledge-Based Systems, Vol. 19, pp. 413–421, 2006.
G. Bologna, “A model for single and multiple knowledge based networks”, Artificial Intelligence in Medicine, Vol. 28, pp. 141–163, 2003.
C.C. Bojarczuk, H.S. Lopes, A.A. Freitas, E.L. Michalkiewicz, “A constrained-syntax genetic programming system for discovering classification rules: Application to medical datasets”, Artificial Intelligence in Medicine, Vol. 30, pp. 27–48, 2004.
C. Setzkorn, R.C. Paton, “On the use of multi-objective evolutionary algorithms for the induction of fuzzy classification rule systems”, BioSystems, Vol. 81, pp. 101–112, 2005.
L.M. Wang, X.L. Li, C.H. Cao, S.M. Yuan, “Combining decision tree and naive Bayes for classification”, KnowledgeBased Systems, Vol. 19, pp. 511–515, 2006.
K.C. Tan, E.J. Teoh, Q. Yu, K.C. Goh, “A hybrid evolutionary algorithm for attribute selection in data mining”, Expert Systems with Applications, Vol. 36, pp. 8616–8630, 2009.
E. Elalfi, R. Haque, M.E. Elalami, “Extracting rules from trained neural network using GA for managing Ebusiness”, Applied Soft Computing, Vol. 4, pp. 65–77, 2004.
H. Kahramanli, N. Allahverdi, “Rule extraction from trained adaptive neural networks using artificial immune systems”, Expert Systems with Applications, Vol. 36, pp. 1513–1522, 2009.
K.M. Osei-Bryson, “Post-pruning in decision tree induction using multiple performance measures”, Computers and Operations Research, Vol. 34, pp. 3331–3345, 2007.
J.R. Quinlan, “Learning efficient classification procedures and their application to chess end games”, In: R.S. Michalski, J.G. Carbonell, T.M. Mitchell, Eds., Machine Learning: An Artificial Intelligence Approach, Morgan Kaufmann, Los Altos, CA, USA, pp. 463–482, 1983.
R.S. Michalski, “A theory and methodology of inductive learning, machine learning - an artificial intelligence approach”, In: R.S. Michalski, J.G. Carbonell, T.M. Mitchell, Eds., Machine Learning: An Arti?cial Intelligence Approach, Morgan Kaufmann, Los Altos, CA, USA, pp. 83–134, 1983.
U.M. Fayyad, K.B. Irani, “Multi-interval discretization of continuous-valued attributes”, 13th International Joint Conference on Artificial Intelligence, pp. 1022–1027, 1993.
L. Breiman, J.H. Fiedman, R.A. Olshen, C.J. Stone, Classification and Regression Trees, Wadsworth, Belmont, CA, USA, 1984.
C.H. Lee, “A Hellinger-based discretization method for numeric attributes in classification learning”, KnowledgeBased Systems, Vol. 20, pp. 419–425, 2007.
K.A. Toh, “An error-counting network for pattern classification”, Neurocomputing, Vol. 71, pp. 1680–1693, 2008. M.M. Oprea, “Inductive learning applied to knowledge acquisition for an expert systems”, 35th Year of PetroleumGas University Activity, 2002.
C.J. Blake, C.J. Merze, UCI Repository of Machine Learning Database (Machine Readable Data Repository), Department of Information and Computer Science, University of California, Irvine, 1999. Available at http://archive.ics.uci.edu/ml/datasets.html.
D.K. Kang, K. Sohn, “Learning decision trees with taxonomy of propositionalized attributes”, Pattern Recognition, Vol. 42, pp. 84–92, 2009.
D. Martens, B. Baesens, T.V. Gestel, J. Vanthienen, “Comprehensible credit scoring models using rule extraction from support vector machines”, European Journal of Operational Research, Vol. 183, pp. 1466–1476, 2007.
D.T. Pham, A.A. Afify, “RULES-6: A simple rule induction algorithm for supporting decision making”, 31st Annual Conference of IEEE Industrial Electronics Society, pp. 2184–2189, 2005.
J.D. Nielsen, R. Rumi, A. Salmer´ on, “Supervised classification using probabilistic decision graphs”, Computational Statistics and Data Analysis, Vol. 53, pp. 1299–1311, 2009.
D.R. Carvalho, A.A. Freitas, “A hybrid decision tree/genetic algorithm for data mining”, Information Sciences, Vol. 163, pp. 13–35, 2004.
G. Chan, H. Lui, L. Yu, Q. Wei, X. Zhang, “A new approach to classification based on association rule mining”, Decision Support Systems, Vol. 42, pp. 674–689, 2006.
P. Luukka, “Similarity classifier using similarity measure derived from Yu’s norms in classification of medical data sets”, Computers in Biology and Medicine, Vol. 37, pp. 1133–1140, 2007.