FARKLI VERİ SETLERİ ÜZERİNDE SMO VE J48 ALGORİTMALARININ SINIFLANDIRMA SONUÇLARININ KARŞILAŞTIRILMASI

Amaç: Veri madenciliği disiplinler arası bir alandır, sürekli gelişmekte ve kullanım alanları yaygınlaşmaktadır. Çeşitli tekniklerin ve algoritmaların kullanılmasıyla verilerin güvenilirliğinin sağlanmasına yardımcı olmaktadır. Sınıflandırma, araştırmacılar tarafından yaygın olarak kullanıldığı için önemli bir veri madenciliği tekniğidir.Yöntem: Bu çalışmada, üç farklı öğrenci veri seti üzerinde SMO ve J48 algoritmalarının sınıflandırma sonuçları karşılaştırılmıştır. Çalışmada, üç farklı veri seti ile TP-Oranı, FP-Oranı, Kesinlik, Duyarlık, F-ölçütü ve ROC analizi gibi çeşitli doğruluk ölçümleri kullanılarak, J48 ve SMO algoritmalarının sınıflandırma doğruluğu açısından performansı değerlendirilmiştir.Bulgular ve Sonuç: Yapılan testler sonucunda her üç veri setinde SMO algoritmasının sınıflandırma performansının daha iyi olduğu ortaya konmuştur. 

COMPARISON OF CLASSIFICATION RESULTS OF SMO AND J48 ALGORITHMS ON DIFFERENT DATA SETS

The data sources of institutions, social media shares, articles on websites and forms provide large amounts of data. It is very difficult to process large amounts of data in traditional ways and to produce information for use in decision processes.In this context, data mining can provide the production of the information needed from the available data with the advanced techniques that it offers.Databases are rich in confidential information that will enable rational decision-making. Classification and estimation are two important data analysis techniques used for estimating future data trends or explaining important data classes. These analyzes can be useful in better understanding of large amounts of data. Today, institutions produce large amounts of data, but they have difficulties in revealing meaningful and useful information within these data. It is not easy to analyze large data with traditional statistical methods. Special methods are therefore required to process and analyze data. Data mining methods have emerged to meet this requirement.The aim of this study is to compare the performances of the SMO and J48 algorithms used in the classification of data mining. For this purpose, data mining was performed by using three different student data sets. Data mining is an analysis method that summarizes data and exposes hidden relationships with both useful and understandable data, in unusual ways. This method is one of the processes of knowledge discovery in the database, which first explores scientific and technical data to reveal unknown patterns. Classification is a process that is frequently used in daily life. By classification, the objects are split and separated, that is, each of the mutually exclusive or general categories can be assigned as a class. Many practical decision-making processes can be formulated as a classification problem. For example, people or objects can be one of many categories. Classification is the process of assigning different elements in different classes. These classes may be business rules, class boundaries, or some mathematical functions. The classification process can be constructed on a relationship between a class of the classified element and a known class value and properties. This type of classification is called “supervised learning”. If there are no known examples of a class, this classification is unsupervised. The most common uncontrolled classification approach is clustering. The most common applications of clustering technology are retail basket analysis and fraud detection.The concept of controlled learning in data mining is to teach a classification function on the basis of known data with a classification or to construct a classification model. This function or model converts data from the database into target attributes, so new data can be used in class estimation. The data mining system relates to areas such as spatial data analysis, information retrieval, model recognition, image analysis, signal processing, computer graphics, web technology, economics, business, bioinformatics or psychology, depending on the types of data to be mining or the specific data mining application.SMO (Sequential Minimal Optimization) is a simple algorithm that can quickly solve the SVM QP problem without any extra matrix storage and without using numerical QP optimization steps. SMO chooses to solve the smallest possible optimization problem at every step. The smallest possible optimization problem for the standard SVM QP problem involves two Lagrange multipliers because the Lagrange multipliers must comply with a linear equality constraint. At each step, the SMO selects two Lagrange multipliers to jointly optimize it, finds the most appropriate values ​​for these multipliers and updates the SVM to reflect the new optimal values. The advantage of SMO lies in the fact that the analysis of two Lagrange multipliers can be done analytically. Thus, numerical QP optimization is completely prevented. Although more optimization sub-problems are solved during the algorithm, each sub-problem is so fast that the general QP problem is solved quickly. Furthermore, SMO does not require any additional matrix storage. Therefore, very large SVM training problems can fit into the memory of an ordinary personal computer or workstation. SMO is less sensitive to numerical sensitivity problems since no matrix algorithm is used.J48 is a decision tree algorithm based on the very popular C4.5 algorithm developed by J. Ross Quinlan. Decision trees are a classic way of representing information from a machine learning algorithm and provide a powerful and fast way to express data structures. This algorithm classifies the data recursively. This ensures the maximum accuracy of the training data, but it can only create extreme rules that define the specific behavior characteristics of the data. J48 Algorithm; Based on the Information Gain Theory, it has the ability to automatically process the data to select the relevant properties. It is the iterative algorithm that divides the samples from the point where information gain is the best. The tree structure starts with the process of dividing the subjects and selecting the best root variable of the tree and building it from top to bottom. The J48 is able to perform an effective pruning process to cut weak branches, which is not meaningful. One of the reasons is that the purpose of decision trees is not to discover data, but to create a simple classification model on the data.In this study, three different data sets of university students were used. The data were subjected to the necessary regulations using Excel macros and data warehouses were prepared. After making the necessary conversions, the data is printed in the text file “iibf1.arff ”, “iibf2.arff” and “myo.arff”. In the study, the WEKA Program (Waikato Environment for Knowledge Analysis) version 3.7.2 developed by the University of Waikato was used. For each data set, the student's gender, province, family income level, the number of siblings, number of siblings studying, and entry point were taken as qualifications. The degree of entry score is used in the class definitions. According to the data results, the success rate of the SMO algorithm in the classification is higher compared J48 algorithm, making this algorithm more reliable.

___

  • Aharwal ,Ramesh Prasad (2016), Evaluatıon Of Varıous Classıfıcatıon Technıques Of Weka Using Different Datasets, International Journal of Advance Research and Innovative Ideas in Education, Vol-2 Issue-2, p.558-552Akçetin, Eyüp, Çelik, Ufuk(2014), İstenmeyen Elektronik Posta (Spam) Tespitinde Karar Ağacı Algoritmalarının Performans Kıyaslaması, İnternet Uygulamaları ve Yönetimi (5/2), doi: 10.5505/iuyd.2014.43531, p.43-56Arora, Milandeep and Sharma, Ajay, (2016), Chronic Kidney Disease Detection by Analyzing Medical Datasets in Weka, International Journal of Computer Application (2250-1797) Volume 6– No.4, July- August 2016,p.20-26Bramer, Max (2007), Principles of Data Mining, Springer, London Chaudhary, Niharika, Mehta, Gaurav and Bajaj, Karan (2015), Comparıson Of Classification Algorithms And Design Of A Percentage-Split Based Method For Data Classification, International Journal Of Computer Science & It, Volume 2, Issue 5, p.1-6Daş, Bihter, Varol, Asaf (2013), 2D:4D Sayısal Parmak Oranına Göre Bireylerin Kişilik Durumlarının Sınıflandırılması, International Symposium on Digital Forensics and Security (ISDFS’13)Dong-Peng Yang, Li Jin-Lin, Lun Ran and Chao Zhou, (2008), Applications of Data Mining Methods in the Evaluation of Client Credibility, Applications of Data Mining in E-Business and Finance C. Soares et al. (Eds.), IOS Press, Amsterdam, p.35-43Han, Jiawei and Kamber, Micheline, (2006), Data Mining: Concepts and Techniques, Second Edition, Morgan Kaufmann Publications, San Francisco Jain, Y. K., Yadav, V. K. and Panday, G. S., (2011), “An Efficient Association Rule Hiding Algorithm for Privacy Preserving Data Mining”, International Journal On Computer Science And Engineering, Vol. 3 No. 7, p. 2792-2798.Kaura, P., Singhb, M., Josan, G. S. (2015), Classification and Prediction Based Data Mining Algorithms to Predict Slow Learners in Education Sector, 3rd International Conference on Recent Trends in Computing 2015(ICRTC- 2015), Procedia Computer Science 57,p. 500 – 508 Classification Algorithms Applied to Anneal Dataset Using Data Mining Techniques, International Journal of Future Innovative Science and Engineering Research (IJFISER) , Volume-2, Issue-1, p. 127-134Larose, Daniel T., (2005), Discovering Knowledge In Data, Wiley Publication, New JerseyNisbet, R., Elder, J., and Miner, G., (2009), Handbook of Statistical Analysis and Data Mining Applications, Elsevier Inc, Burlington.Nizam, Hatice, Akın, Saliha Sıla (2014), Sosyal Medyada Makine Öğrenmesi ile Duygu Analizinde Dengeli ve Dengesiz Veri Setlerinin Performanslarının Karşılaştırılması, XIX. Türkiye'de İnternet KonferansıÖzkan, Yalçın (2008), Veri Madenciliği Yöntemleri, Papatya Yayınları, İstanbul Rokach, Lior and Maimon, Oded (2008), Data Mining with Decision Trees, World Scientific, New JerseySalama, Gouda, Abdelhalim, M. B., and Zeid,Magdy Abd-elghany (2012), Experimental Comparison of Classifiers for Breast Cancer Diagnosis, 978-1-4673-2961-3/12 ©2012 IEEE, DOI: 10.1109/ICCES.2012.6408508 p. 180-185Singaravelan, S., Murugan, D. and 1R. Mayakrishnan (, 2015), Analysis of Classification Algorithms J48 and Smo on Different Datasets, World Engineering & Applied Sciences Journal 6 (2): p.119-123Tadesse, T., Wardlow, B. And Hayes, M.J. (2009), The Application of Data Mining for Drought Monitoring and Prediction, Data Mining Applications for Empowering Knowledge Societies, Edited by Hakikur Rahman, Information Science Reference, New York, p.280-291Weiss, Sholom M. And Zhang, Tong (2003), Performance Analysis and Evaluation, The Handbook of Data Mining, Edited by. Nong Ye, Lawrence Erlbaum Associates Publishers. London, p.436-439Wu, Tong and Li Xiangyang (2003), Data Storage and Management, The Handbook of Data Mining, Edited by. Nong Ye, Lawrence Erlbaum Associates Publishers. London, p.393-407