Hasan YILDIRIM

Değişken Önemini Değerlendirmeye Dayalı Makine Öğrenme Algoritmalarının Karşılaştırmalı Analizi

Makine öğrenme çalışmalarındaki ana amaçlarından biri, belirli bir araştırma problemindeki en önemli değişkenleri belirlemektir. Bu amaca ulaşmak için çeşitli algoritmalar geliştirilmiştir. Random forest, Cubist ve MARS algoritmaları bu algoritmalar arasında en yaygın olanlardır. Klasik istatistiksel algoritmalar bir çıktı üzerinde etkili olan değişkenlerin önem seviyelerini elde etmede belirli bir dereceye kadar faydalı olmasına rağmen, makine öğrenme algoritmaları daha açık ve kesin sonuçlar sağlayabilir. Bu çalışmada, Random forest, Cubist ve MARS algoritmalarının tahmin sonuçları, hata kareler ortalaması, belirleyicilik katsayısı ve ortalama mutlak hata gibi bazı performans kriterleri açısından gerçek bir veri seti kullanılarak karşılaştırmalı olarak sunulmuştur. Sonuçlar, Random forest ve Cubist performanslarının birbirine benzer ama Mars'tan daha iyi olduğunu göstermektedir. Ek olarak, en önemli değişkenlerin sırası algoritma türüne göre değişmektedir. Algoritmalar arasında ki uyum istatistiksel bir bakışla incelenmiş ve tatmin edici bulunmuştur. Sonuç olarak, Random forest, Cubist ve MARS hem tahmin performansı hem de değişken önemi hesabında etkili ve kullanışlı algoritmalar olarak göz önüne alınabilir.

Anahtar Kelimeler:

Cubist, Random Forest, Makine Öğrenmesi, Mars, Değişken Önemi

Comparative Analysis of Machine Learning Algorithms Based on Variable Importance Evaluation

One of the main goals in machine learning studies is to determine the most significant variables on a specific research problem. Various algorithms have been developed to achieve this goal. Random forest, Cubist, and MARS algorithms are the most common ones among these algorithms. Although classical statistical algorithms have been useful to obtain the importance level of the effective variables on the output in a certain amount, the machine learning algorithms may provide clearer and more precise results. In this study, the estimation results of Random Forest, Cubist, and MARS algorithms have been presented comparatively in terms of some performance criteria like mean squares error, the coefficient of determination, and mean absolute error by using a real data set. The results show that the performances of Random Forest and Cubist are similar amongst themselves but better than MARS. Additionally, the rank of the most important variables varies according to the type of algorithm. The concordance between algorithms is investigated from a statistical perspective and found satisfactory. Consequently, Random Forest, Cubist, and MARS can be considered effective and reasonable algorithms for both estimation performance and variable importance evaluation.

Keywords:

Cubist, Random Forest, Machine Learning, Mars, Variable Importance,

PDF

___

[1] Ertoy, U. & Akçay, M. (2021). Covid-19 Virüsü Salgını İle Mücadelede Büyük Veri Çalışmaları: Çin Örneği . Journal of Scientific, Technology and Engineering Research , 2 (2) , 4-14 . DOI: 10.5281/zenodo.4718425.
[2] Pazar, Ş. , Bulut, M. & Uysal, C. (2020). Yapay Zeka Tabanlı Araç Algılama Sistemi Geliştirilmesi . Journal of Scientific, Technology and Engineering Research , 1 (1) , 31-37 . DOI: 10.5281/zenodo.3922425
[3] Li, J., Cheng, K., Wang, S., Morstatter, F., Trevino, R. P., Tang, J., & Liu, H. (2017). Feature selection: A data perspective. ACM Computing Surveys (CSUR), 50(6), 1-45. DOI: 10.1145/3136625.
[4] Hall, M. A., & Smith, L. A. (1999, May). Feature selection for machine learning: comparing a correlation-based filter approach to the wrapper. In FLAIRS conference (Vol. 1999, pp. 235-239). DOI: 10.5555/646812.707499.
[5] Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of machine learning research, 3(Mar), 1157-1182. DOI: 10.5555/944919.944968.
[6] Saeys, Y., Inza, I., & Larranaga, P. (2007). A review of feature selection techniques in bioinformatics. bioinformatics, 23(19), 2507-2517. DOI: 10.1093/bioinformatics/btm344.
[7] Alelyani, S., Tang, J., & Liu, H. (2018). Feature selection for clustering: A review. Data Clustering, 29-60. DOI: 10.1201/9781315373515-2.
[8] Chandrashekar, G., & Sahin, F. (2014). A survey on feature selection methods. Computers & Electrical Engineering, 40(1), 16-28. DOI: 10.1016/j.compeleceng.2013.11.024.
[9] Tang, J., Alelyani, S., & Liu, H. (2014). Feature selection for classification: A review. Data classification: Algorithms and applications, 37. DOI: 10.1201/b17320.
[10] El-Hasnony, I. M., Barakat, S. I., Elhoseny, M., & Mostafa, R. R. (2020). Improved feature selection model for big data analytics. IEEE Access, 8, 66989-67004. DOI: 10.1109/ACCESS.2020.2986232.
[11] Karasu, S., Altan, A., Bekiros, S., & Ahmad, W. (2020). A new forecasting model with wrapper-based feature selection approach using multi-objective optimization technique for chaotic crude oil time series. Energy, 212, 118750. DOI: 10.1016/j.energy.2020.118750
[12] Sharma, M., & Kaur, P. (2021). A Comprehensive Analysis of Nature-Inspired Meta-Heuristic Techniques for Feature Selection Problem. Archives of Computational Methods in Engineering, 28(3). DOI: 10.1007/s11831-020-09412-6.
[13] Breiman, L. (2001). Random forests. Machine learning, 45(1), 5-32. DOI: 10.1023/A:1010933404324.
[14] James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). Statistical learning. In An introduction to statistical learning (pp. 15-57). Springer, New York, NY. DOI: 10.1007/978-1-0716-1418-1_2.
[15] Hastie, T., Hastie, T., Tibshirani, R., & Friedman, J. H. (2009). The elements of statistical learning: Data mining, inference, and prediction. New York: Springer. DOI: 10.1007/978-0-387-84858-7.
[16] Friedman, J. H. (1991). Multivariate adaptive regression splines. The annals of statistics, 1-67. DOI: 10.1214/aos/1176347963.
[17] Kuhn, M., & Johnson, K. (2013). Applied predictive modeling (Vol. 26, p. 13). New York: Springer. DOI: 10.1007/978-1-4614-6849-3.
[18] Quinlan, J. R. (1987). Simplifying decision trees. International journal of man-machine studies, 27(3), 221-234. DOI: 10.1016/S0020-7373(87)80053-6.
[19] Quinlan, J. R. (1992, November). Learning with continuous classes. In 5th Australian joint conference on artificial intelligence (Vol. 92, pp. 343-348). DOI: 10.1142/9789814536271
[20] Quinlan, J. R. (1993, June). Combining instance-based and model-based learning. In Proceedings of the tenth international conference on machine learning (pp. 236-243). DOI: 10.5555/3091529.3091560.
[21] Yıldırım, H. (2019). Property value assessment using artificial neural networks, hedonic regression and nearest neighbors regression methods. Selcuk University Journal of Engineering, Science and Technology, 7(2), 387-404. DOI: 10.15317/Scitech.2019.207.
[22] Zingat. An online real estate website. https://www.zingat.com. (Last Access: March, 2018 ).
[23] Kuhn, M. (2008). Building predictive models in R using the caret package. Journal of statistical software, 28(1), 1-26. DOI: 10.18637/jss.v028.i05.
[24] Milborrow, S. (2019). earth: Multivariate Adaptive Regression Splines. R package version 5.1.1.
[25] Kuhn, M., Weston, S., Keefer, C., Coulter, N., & Quinlan, R. (2014). Cubist: Rule-and instance-based regression modeling, R package version 0.0. 18.
[26] Liaw, A., & Wiener, M. (2002). Classification and regression by randomForest. R news, 2(3), 18-22. DOI: 10.1021/ci034160g.