Ayşe ÇINAR

VERİ MADENCİLİĞİNDE SINIFLANDIRMA ALGORİTMALARININ PERFORMANS DEĞERLENDİRMESİ VE R DİLİ İLE BİR UYGULAMA

Sınıflandırma Yöntemi, veri madenciliğinin başlıca yöntemlerinden biri olup, öğrenme algoritmasına dayanır. Büyük ölçekli bir veri içinde gizli kalmış bir örüntüyü keşfetmek amacıyla uygulanır. Veri madenciliği kapsamında, örüntü, bir varlık için dijital ortamda kaydedilmiş; gözlemlenebilir, ölçülebilir ve tekrar edilebilir bir bilgi olarak ifade edilmektedir. Ulaşılmak istenen bilginin elde edilmesi için uygulanan sınıflandırma algoritmaları, içerdiği verinin ortak özelliğine göre veri setinin belirli sınıflara ayrılmasını (ayrıklaştırılmasını) sağlamaktadırlar. Bu işlemin ardından bir sınıflandırma modeli elde edilir. Elde edilen sınıflandırma modeli yeni bir veri seti üzerinde uygulanarak, model ile belirlenmiş olan sınıfların veri seti içindeki benzerlerinin varlığı araştırılır. Söz konusu işlem “örüntü tanıma” olarak isimlendirilmektedir. Bu çalışmada, veri madenciliğinde sınıflandırma süreci ele alınarak, C5.0 ve Gini isimli iki farklı sınıflandırma algoritması ile bir uygulama gerçekleştirilmiştir. Bu amaçla açık kaynak kodlu R dili uygulanarak, her iki sınıflandırma modelinin tahmin değerlerinin doğruluğuyla ilgili performans ölçüm değerleri elde edilmiştir. Ayrıca, en iyi performans ölçüm değerine sahip bir model ele alınarak, sonuçları değerlendirilmiştir

Anahtar Kelimeler:

Sınıflandırma Yöntemi, Sınıflandırma Algoritmaları, R Dili, Gini Algoritması, C5.0 Algoritması, Karışıklık Matrisi, Performans Değerlendirme

PERFORMANCE EVALUATION OF CLASSIFICATION ALGORITHMS IN DATA MINING AND AN APPLICATION WITH THE R LANGUAGE

Knowledge discovery in databases (KDD) is the overall process of exploring previously unknown and useful knowledge in large volumes of data. The first stage of KDD is the process of ETL (extract, transform, load). It involves the following sequential steps in the process of KDD: Extracting raw data from a data source, applying data preprocessing and loading the processed data into several data repositories, such as databases, data warehouses. Data preprocessing technique is used to convert a raw data into a clean and proper data set according to the purpose of a related project. Data mining is an important part of the process in knowledge discovery. Compared to the traditional analyzing techniques, data mining is a process in order to extract understandable, valuable and previously unknown information in a large amount of dataset. Data mining techniques are divided into two different categories such as supervised learning and unsupervised learning. Supervised learning is a machine learning. Applying a supervised learning technique, a classification model called training model, is built with a reference. By using the built classification model, the class of testing data is predicted. Accordingly, there are some supervised learning techniques, such as Classification, Decision Tree, Bayesian Classification, Neural Networks, Association Rule Mining. Unsupervised learning is a type of machine learning. The difference between Supervised learning and Unsupervised learning is unsupervised learning learns from the data but without reference. Therefore, it is not necessary to create a prior model in unsupervised learning. Clustering is one of the unsupervised learning techniques. It separates data into some groups called clusters in which objects are similar to each other. Several data mining techniques have been developing that are used for knowledge discovery from a large amount of datasets including Classification, Clustering, Decision Tree, Bayesian Classification, Neural Networks, Association Rule Mining, Prediction, Sequential Pattern and Genetic Algorithm, Time Series and Nearest Neighbor. The classification method which is one of the main methods of data mining is based on learning algorithm. It is applied in order to discover hidden patterns in a large-scale data. Following the ETL process, a classification model is created by selecting one of data mining methods. Within the scope of data mining, a pattern is expressed as an observable, measurable and repeatable information that is stored in digital area for an entity. Classification algorithms that are applied in order to obtain a target information separate a dataset into several groups according to the common feature of the data. After the mentioned process, a classification model is obtained. Applying the obtained classification model on a new data set, the similar examples of the classes that are determined by the model are analyzed. The mentioned process is called as “pattern recognition”. The dataset is divided into two sets called training and testing datasets in order to build predictive models. The aim of the study is to apply some classification algorithms on a dataset and evaluate the performance of the models in terms of the prediction accuracy. For the purpose of the study, a database named “Data_User_Modeling_Dataset_Hamdi_Tolga_KAHRAMAN.xls” was chosen as sample case. The database contains raw data about the knowledge level of the learners in e-learning systems. It is possible to download the mentioned data from the website named “UCI Machine Learning” as a dataset. In the study, an application was performed by two different classification algorithms called C5.0 and Gini by considering the classification process in data mining. Additionally, in order to build some predictive models, the dataset was divided into two different sets called training and testing datasets with predetermined rate in the whole dataset. Accordingly, the open-source R programming language was applied for the both classification algorithms in order to build a classification model. As a result of the execution of the written R codes, some decision rules and a decision tree were obtained for both algorithms with the handled training dataset. After the prediction of the class of each testing data, the performance measures on the accuracy of the predicted values of the both models were estimated with the current class of each observation in the testing dataset. When the results were evaluated, a model that had the best performance was handled and its results were evaluated. The results of the selected classification model showed that the attribute related to the exam performance of the learners for goal objects (PEG) became the most deterministic predictor on their knowledge levels. Accordingly, the attribute related to the exam performance of the learners for related objects with goal object (LPR) took second place in order of importance.

Keywords:

Classification Method, Classification Algorithms, R Language, Gini Algorithm, C.50 Algorithm, Confusion Matrix, Performance Evaluation,

PDF

___

Adak, M. F. & Yurtay, N. (2013). Gini Algoritmasını Kullanarak Karar Ağacı Oluşturmayı Sağlayan Bir Yazılımın Geliştirilmesi, Bilişim Teknolojileri Dergisi, Cilt: 6, Sayı: 3, 1-6. Balaban, M. E. & Kartal, E. (2016). Veri Madenciliği ve Makine Öğrenmesi Temel Algoritmaları ve R Dili ile Uygulamaları, Çağlayan Yayınevi, İstanbul. Cunningham, P., Cord, M. & Delany, S. J. (2008). Supervised Learning, Machine Learning Techniques for Multimedia, Chapter 2, Springer, 21-49. Han, J. & Kamber, M. (2012). Data Mining: Concepts and Techniques, Elsevier Inc., Third Edition, USA. Hastie, T., Tibshirani, R. & Friedman, J. (2008). The Elements of Statictical Learning: Data Mining, Inference, and Prediction, Second Edition, Springer. Kahraman, H.T., Sağıroğlu, S. & Çolak, I. (2013). Developing Intuitive Knowledge Classifier And Modeling Of Users ‘domain Dependent Data In Web, Knowledge Based Systems, vol. 37, 283-295. Kantardzic, M. (2011). Data Mining Concepts, Models, Methods, and Algorithms, A John Wiley & Sons, Inc., Second Edition, USA. Kohavi, R. & Provost, F. (1998). Glossary of Terms Special Issue on Applications of Machine Learning and the Knowledge Discovery Process, Kluwer Academic Publishers, Boston, http://robotics.stanford.edu/~ronnyk/glossary.html (05.11.2017). Kuhn, M.( 2017). Classification and Regression Training, Package ‘caret’ Version 6.0-77, https:// cran.r-project.org/web/packages/caret/caret.pdf (05.11.2017). Kuhn, M., Weston, S., Coulter, N. ,Culp, M. & Quinlan, R. (2018), Decision Trees and Rule-Based Models, Package ‘C50’, https://cran.r-project.org/web/packages/C50/C50.pdf (05.10.2018). Kumar, S. V. K. & Kiruthika, P. (2015). An Overview of Classification Algorithm in Data mining, International Journal of Advanced Research in Computer and Communication EngineeringIJARCCE, Vol. 4, Issue 12, 255-257, https://www.ijarcce.com/upload/2015/december-15/ IJARCCE%2059.pdf (19.02.2018). Markham, K. (2014). Simple Guide To Confusion Matrix Terminology, http://www.dataschool.io/ simple-guide-to-confusion-matrix-terminology/ (05.11.2017). Özkan,Y. & Erol, Ç.S. (2015). Biyoenformatik DNA Mikrodizi Veri Madenciliği, Papatya Yayıncılık, İstanbul. Özkan, Y. (2016). Veri Madenciliği Yöntemleri, Papatya Yayıncılık Eğitim A.Ş., Üçüncü Basım, İstanbul. Pandya, R. & Pandya, J. (2015). C5.0 Algorithm to Improved Decision Tree with Feature Selection and Reduced Error Pruning, International Journal of Computer Applications (0975 – 8887), Volume 117 – No. 16, 18-21, http://research.ijcaonline.org/volume117/number16/ pxc3903318.pdf (19.02.2018). Sokolova, M. & Lapalme, G.(2009). A Systematic Analysis of Performance Measures For Classification Tasks, Information Processing and Management, 45 (2009) 427–437, Elsevier Inc., http://atour. iro.umontreal.ca/rali/sites/default/files/publis/SokolovaLapalme-JIPM09.pdf (19.02.2018). Therneau, T., Atkinson, B. & Ripley, B. (2017). Recursive Partitioning and Regression Trees, Package ‘rpart’, https://cran.r-project.org/web/packages/rpart/rpart.pdf (05.11.2017). UCI (2009). https://archive.ics.uci.edu/ml/datasets/User+Knowledge+Modeling (05.11.2017).