Ayrıklaştırma Yöntemlerinin Karar ağaçlar ve Karar Kuralları Sınıflandırıcılar için Medikal Veri Setleri Üzerinde Karşılaştırılması

Gerçek hayattaki veri kümeleri, veri tabanlarında reel sayılarla sunulmaktadır. Öte yandan, birliktelik kuralları ve tümevarım kuralları gibi birçok veri madenciliği yöntemi yalnızca ayrık öznitelikler gerektirirler. Bu nedenle sürekli özniteliklere sahip veri kümelerinin ayrık özniteliklere sahip veri kümelerine dönüştürülmesi gerekmektedir. Ayrıklaştırma işlemi, belirli bir sürekli öznitelik verisini aralıklara bölerek değer sayısını azaltmaktır. Bu çalışmada, kural ve ağaç tabanlı JRip, OneR, J48 ve Part sınıflandırıcı algoritmaları ile sekiz ayrıklaştırma yöntemi naliz edilmiştir. Denemeler, UCI veri deposundan alınan gerçek veri setlerinden oluşmakta ve on kat çapraz doğrulamayı sonuçlarını içermektedir. Bu algoritmaların sınıflandırma başarımı önemli ölçüde artırmada ayrıklaştırmanın önemli bir adım araç olduğunu görülmüştür. Son olarak, çalışma sonucunda PIMA, WBC ve DERMA veri setleri için sırasıyla MDL ve J48, CAIM ve Jrip ve Extended Chi ve J48 yöntemlerinin en yüksek doğruluğu verdiği görülmüştür.

Anahtar Kelimeler:

Sınıflandırma, Sürekli Öznitelikler, Ayrıklaştırma, Veri Madenciliği

Comparison of discretization methods for classifier decision trees and decision rules on medical data sets

Data sets in real life are given by real numbers in databases. On the other hand, many data mining methods like association rules and induction rules require only discrete attributes. For this reason, it is necessary to convert data sets with continuous attributes into data sets with discrete attributes. Discretization process is reducing the number of values for a given continuous attribute by dividing the range of the attribute into intervals. In this paper, eight discretization methods are presented with JRip, OneR, J48, and Part classifier algorithms of rules and tress. Experiments include a ten-fold cross validation provided on the basis of real-life data sets from the UCI repository. We show that discretization is important step to significantly increase the classification results of these algorithms. Finally, as a result of the study, it was seen that MDL and J48, CAIM and Jrip and Extended Chi and J48 methods gave the highest accuracy for PIMA, WBC and DERMA data sets, respectively.

Keywords:

Classification, Continuous attributes, Discretization, Data Mining,

PDF

___

Abraham, R., Simha, J. B., & Iyengar, S. S. (2009). Effective Discretization and Hybrid feature selection using Naïve Bayesian classifier for Medical datamining. International Journal of Computational Intelligence Research, 5(2), 116–129.
Chmielewski, M. R., & Grzymala-Busse, J. W. (1996). Global discretization of continuous attributes as preprocessing for machine learning. International Journal of Approximate Reasoning, 15(4), 319–331.
Cohen, W. W. (1995). Fast effective rule induction. In Machine learning proceedings 1995 (pp. 115–123). Elsevier.
Das, K., & Vyas, O. P. (2010). A suitability study of discretization methods for associative classifiers. International Journal of Computer Applications, 5(10), 0975–8887.
Dermatology dataset. Available from: Https://archive.ics.uci.edu/ml/datasets/Dermatology. (n.d.).
Dua, D., & Graff, C. (2019). UCI Machine Learning Repository [http://archive. Ics. Uci. Edu/ml]. Irvine, CA: University of California. School of Information and Computer Science, 25, 27.
Fayyad, U., & Irani, K. (1993). Multi-interval discretization of continuous-valued attributes for classification learning.
Fayyad, U. M., & Irani, K. B. (1992). On the handling of continuous-valued attributes in decision tree generation. Machine Learning, 8(1), 87–102.
Ferreira, A. J., & Figueiredo, M. A. (2012). An unsupervised approach to feature discretization and selection. Pattern Recognition, 45(9), 3048–3060.
Frank, E., & Witten, I. H. (1998). Generating accurate rule sets without global optimization.
Garcia, S., Luengo, J., Sáez, J. A., Lopez, V., & Herrera, F. (2012). A survey of discretization techniques: Taxonomy and empirical analysis in supervised learning. IEEE Transactions on Knowledge and Data Engineering, 25(4), 734–750.
Gonzalez-Abril, L., Cuberos, F. J., Velasco, F., & Ortega, J. A. (2009). Ameva: An autonomous discretization algorithm. Expert Systems with Applications, 36(3), 5327–5332.
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I. H. (2009). The WEKA data mining software: An update. ACM SIGKDD Explorations Newsletter, 11(1), 10–18.
Hishamuddin, M. N. F., Hassan, M. F., & Mokhtar, A. A. (2020). Improving Classification Accuracy of Random Forest Algorithm Using Unsupervised Discretization with Fuzzy Partition and Fuzzy Set Intervals. Proceedings of the 2020 9th International Conference on Software and Computer Applications, 99–104.
Holte, R. C. (1993). Very simple classification rules perform well on most commonly used datasets. Machine Learning, 11(1), 63–90.
Jin, R., Breitbart, Y., & Muoh, C. (2009). Data discretization unification. Knowledge and Information Systems, 19(1), 1–29.
Jun, S. (2021). Evolutionary Algorithm for Improving Decision Tree with Global Discretization in Manufacturing. Sensors, 21(8), 2849.
Kerber, R. (1992). Chimerge: Discretization of numeric attributes. Proceedings of the Tenth National Conference on Artificial Intelligence, 123–128.
Kotsiantis, S., & Kanellopoulos, D. (2006). Discretization techniques: A recent survey. GESTS International Transactions on Computer Science and Engineering, 32(1), 47–58.
Kurgan, L. A., & Cios, K. J. (2004). CAIM discretization algorithm. IEEE Transactions on Knowledge and Data Engineering, 16(2), 145–153.
Li, Y., Liu, L., Bai, X., Cai, H., Ji, W., Guo, D., & Zhu, Y. (2010). Comparative study of discretization methods of microarray data for inferring transcriptional regulatory networks. BMC Bioinformatics, 11(1), 1–6.
Liu, H., & Setiono, R. (1995). Chi2: Feature selection and discretization of numeric attributes. Proceedings of 7th IEEE International Conference on Tools with Artificial Intelligence, 388–391.
Menéndez, L. Á., de Cos Juez, F. J., Lasheras, F. S., & Riesgo, J. Á. (2010). Artificial neural networks applied to cancer detection in a breast screening programme. Mathematical and Computer Modelling, 52(7–8), 983–991.
NGUYEN, H. S. (1998). Discretization methods in data mining. Rough Sets in Knowledge Discovery, 451–482.
Pima Indians Diabetes dataset. Available from: Https://archive.ics.uci.edu/ml/datasets/diabetes. (n.d.).
Quinlan, J. R. (2014). C4.5: Programs for machine learning. Elsevier.
Rajput, A., Aharwal, R. P., Dubey, M., Saxena, S., & Raghuvanshi, M. (2011). J48 and JRIP rules for e-governance data. International Journal of Computer Science and Security (IJCSS), 5(2), 201.
Su, C.-T., & Hsu, J.-H. (2005). An extended chi2 algorithm for discretization of real value attributes. IEEE Transactions on Knowledge and Data Engineering, 17(3), 437–441.
Tran, B., Xue, B., & Zhang, M. (2017). A new representation in PSO for discretization-based feature selection. IEEE Transactions on Cybernetics, 48(6), 1733–1746.
Tsai, C.-F., & Chen, Y.-C. (2019). The optimal combination of feature selection and data discretization: An empirical study. Information Sciences, 505, 282–293.
Tsai, C.-J., Lee, C.-I., & Yang, W.-P. (2008). A discretization algorithm based on class-attribute contingency coefficient. Information Sciences, 178(3), 714–731.
Wolberg, Wi. H., & Mangasarian, O. (1992). Breast cancer wisconsin (original) data set. UCI Machine Learning Repository.
Xu, X. (2006). Adaptive intrusion detection based on machine learning: Feature extraction, classifier construction and sequential pattern prediction. International Journal of Web Services Practices, 2(1–2), 49–58.