ZEYNEL CEBECİ, FİGEN YILDIZ

Unsupervised Discretization of Continuous Variables in a Chicken Egg Quality Traits Dataset

Ayrıklaştırma, sınıflama ağaçları ve birliktelik kuralları çıkarma gibi bazı veri madenciliği algoritmalarında sürekli değişkenleri kesikli değişkenlere dönüştüren bir veri önişleme adımıdır. Bu çalışmada eşit genişlikli aralıklar (EWI), eşit frekanslı aralıklar (EFI) ve K-ortalamalar kümelemesi (KMC) yöntemleri, bir tavuk yumurtası kalite özellikleri veri setinde 14 sürekli değişkenin ayrıklaştırmasındaki performansları bakımından deneysel olarak karşılaştırılmıştır. Bu yönetimsiz ayrıklaştırma yönteminin sınıflama ağacı modelleri için öğrenme hatalarını düşürdüğü ve doğruluğu yükselttiği belirlenmiştir. C5.0 sınıflama ağacı algoritması kullanılarak uygulanan modelin öğrenme hatası ve test doğruluğu kullanılarak yapılan karşılaştırmalara göre EWI, EFI ve KMC yöntemlerinin birbirine yakın sonuçlar verdikleri görülmüştür. Yöntemlerde aralık sayısını hesaplamak için kullanılan kurallar arasında, Rice kuralı EFI'de olmamakla birlikte EWI ile en iyi sonucu üretmiştir. Ayrıca EWI ile Freedman-Diaconis kuralının ve EFI ve EWI'nin her ikisinde ise Doane kuralının diğer kurallardan kısmen daha iyi oldukları saptanmıştır.

Tavuk Yumurtası Kalite Özellikleri Veri Setindeki Sürekli Değişkenlerin Yönetimsiz Ayrıklaştırılması

Discretization is a data pre-processing task transforming continuous variables into discrete ones in order to apply some data mining algorithms such as association rules extraction and classification trees. In this study we empirically compared the performances of equal width intervals (EWI), equal frequency intervals (EFI) and Kmeans clustering (KMC) methods to discretize 14 continuous variables in a chicken egg quality traits dataset. We revealed that these unsupervised discretization methods can decrease the training error rates and increase the test accuracies of the classification tree models. By comparing the training errors and test accuracies of the model applied with C5.0 classification tree algorithm we also found that EWI, EFI and KMC methods produced the more or less similar results. Among the rules used for estimating the number of intervals, the Rice rule gave the best result with EWI but not with EFI. It was also found that Freedman-Diaconis rule with EFI and Doane rule with EFI and EWI slightly performed better than the other rules.

PDF

___

Biba M, Esposito F, Ferilli S, Mauro ND, Basile TMA. 2007. Unsupervised. Discretization using kernel density estimation. Proc. of the 20th Int. Conf. on AI, Hyderabad, India, p. 696–701.
Brooks CEP, Carruthers N. 1953. Handbook of statistical methods in meteorology. H M Stationery Office, London.
Cantú-Paz E. 2001. Supervised and unsupervised discretization methods for evolutionary algorithms. In proc. Of the genetic and evolutionary computation conference (GECCO-2001), p. 213-216.
Cebeci Z, Yıldız F, Kayaalp GT. 2015. K-ortalamalar kümelemesinde optimum K değeri seçilmesi. 2. Ulusal Yönetim Bilişim Sistemleri Kongresi. Erzurum, 8-10 Ekim 2015. Bildiriler Kitabı (Ed: Ü. Özen ve ark.), p. 231-242.
Cencov NN. 1962. Evaluation of an unknown distribution density from observations. Soviet Mathematics, 3: 1559–1562.
Doane DP. 1976. Aesthetic frequency classification. American Statistician, 30 (4): 81-183.
Dash R, Paramguru RL, Lochan RR, Dash. 2011. Comparative analysis of supervised and unsupervised discretization techniques. Int. J. of Advances in Science and Technology, 2(3): 29-37.
Davies OL, Goldsmith PL .1980. Statistical methods in research and production. 4 th edn longman London, p. 478.
Doran JE, Hodson FR. 1975. Mathematics and computers in archaeology. Massachusetts: Harvard Univ. Press Cambridge, p. 381.
Dougherty J, Kohavi R, Sahami M. 1995. Supervised and unsupervised discretization of continuous feature. In proc. Of the 12th Int. Conf. on Machine Learning, p. 194-202.
Freedman D, Diaconis P. 1981. On the histogram as a density estimator: L2 theory. Zeit. Wahr. ver. Geb. 57(4): 453–476.
García S, Luengo J, Sáez J A, López V, Herrera F. 2013. A survey of discretization techniques, taxonomy and empirical analysis in supervised learning. IEEE Transactions on Knowledge and Data Engineering, 25(4): 734-750.
Hahsler M, Buchta C, Gruen B, Hornik K. 2016. Arules: mining association rules and frequent itemsets. R package version 1.4-1. https://CRAN.R-project.org/package=arules [Accessed on 26.07.2016].
Hemada B, Lakshmi KSV. 2013. A study on discretization techniques. Int. J. of Engineering Research & Technology, 2(8): 1887-1892.
Huntsberger DV. 1962. Elements of statistical inference. London: prentice-hall.
Hyndman RJ. 1995. The problem with Sturges’ rule for constructing histograms.
URL: http://robjhyndman.com/papers/sturges.pdf [Accessed on 26.07.2016]
Kotsiantis S, Kanellopoulos D. 2006. Discretization techniques: a recent survey. GESTS International Transactions on Computer Science and Engineering, 32 (1): 47-58.
Kuhn M, Weston S, Coulter N, Clup M. 2016. C50: C5.0 decision trees and rule-based models. R package version 0.1.0-24 (C code for C5.0 by R. Quinlan License: GPL-3) (https://cran.r-project.org/web/packages/C50/) [Accessed on 26.07.2016]
Lane DM, Scott D, Hebl M, Guerra R, Osherson D, Zimmer H. 2016. Introduction tostatistics: a multimedia course of study (http://onlinestatbook.com/) [Accessed on 26.07.2016]
Liu H, Hussain F, Tan C L, Dash M. 2002. Discretization: an enabling technique. Data Mining and Knowledge Discovery, 6(4): 393-423.
Muhlenbach F, Rakotomalala R. 2005. Discretization of continuous attributes. In Encyclopedia of Data Warehousing and Mining (Ed. J. Wang), p: 397–402.
Pham DT, Dimov SS, Nguyen CD. 2005. Selection of K in Kmeans clustering. Journal of Mechanical Engineering Science, 219: 103 -119.
R Development Core Team 2016. A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL: https://www.Rproject.org/
Ramírez-Gallego S, García S, Mouriño-Talín H, Martínez-Rego D, Bolón-Canedo V, Alonso-Betanzos A, Benítez JM, Herrera F. 2015. Data discretization: taxonomy and big data challenge. WIREs Data Mining Knowledge Discovery, 6(1): 5- 21.
Rodriguez G. 2016. Kselection: selection of K in K-means clustering. R package version 0.2.0. http://CRAN.Rproject.org/package=kselection [Accessed on 26.07.2016].
Scott DW. 1979. On optimal and data-based histograms. Biometrika, 66(3): 605–610.
Scott DW. 1992. Multivariate density estimation: theory, practice and visualization. New York: John Wiley & Sons.
Sturges H (1926). The choice of a class-interval. J. Amer. Statist. Assoc. 21(153): 65– 66.
Terrell GR, Scott DW. 1985. Oversmoothed Nonparametric Density Estimates. Journal of the American Statistical Association, 80(389): 209–214.