Yazılım hata tahmininde kullanılan metriklerin karar ağaçlarındaki bilgi kazançlarının incelenmesi ve iyileştirilmesi

Yazılım kalitesinin somut bir şekilde ölçülebilmesi için kullanılan sayısal yazılım metrikleri içinde bilinen ve yaygın şekilde kullanılanlar arasında McCabe ve Halstead yöntem-seviye metrikleri bulunmaktadır. Yazılım hata tahmini, geliştirilecek olan yazılımda bulunan alt modüllerin hangisi veya hangilerinin daha çok hataya meyilli olabileceğini konusunda öngörüde bulunabilmektedir. Böylece işgücü ve zaman konusundaki kayıpların önüne geçilebilmektedir. Yazılım hata tahmini için kullanılan veri kümelerinde, hata var sınıflı kayıt sayısı, hata yok sınıflı kayıt sayısına göre daha az sayıda olabildiğinden bu veri kümeleri genellikle dengeli olmayan bir sınıf dağılımına sahip olmakta ve makine öğrenme yöntemlerinin sonuçlarını olumsuz etkilemektedir. Bilgi kazancı, karar ağaçları ve karar ağacı temeline dayanan kural sınıflayıcı, nitelik seçimi gibi algoritma ve yöntemlerde kullanılmaktadır. Bu çalışmada, yazılım hata tahmini için önemli bilgiler sunan yazılım metrikleri incelenmiş, NASA’nın PROMISE yazılım veri deposundan CM1, JM1, KC1 ve PC1 veri kümeleri sentetik veri artırım Smote algoritması ile daha dengeli hale getirilerek bilgi kazancı yönünden iyileştirilmiştir. Sonuçta karar ağaçlarında sınıflama başarı performansı daha yüksek yazılım hata tahmini veri kümeleri ve bilgi kazanç oranı yükseltilmiş yazılım metrik değerleri elde edilmiştir.

Anahtar Kelimeler:

Yazılım hata tahmini, Karar ağaçları, Bilgi kazanç oranı

Analyzing and improving information gain of metrics used in software defect prediction in decision trees

McCabe and Halstead method-level metrics are among the well-known and widely used quantitative software metrics are used to measure software quality in a concrete way. Software defect prediction can guess which or which of the sub-modules in the software to be developed may be more prone to defect. Thus, loss of labor and time can be avoided. The datasets which are used for software defect prediction, usually have an unbalanced class distribution, since the number of records with defective class can be fewer than the number of records with not defective class and this situation adversely affect the results of the machine learning methods. Information gain is employed in decision trees and decision tree based rule classifier and attribute selection methods. In this study, software metrics that provide important information for software defect prediction have been investigated and CM1, JM1, KC1 and PC1 datasets of NASA's PROMISE software repository have been balanced with the synthetic data over-sampling Smote algorithm and improved in terms of information gain. As a result, the software defect prediction datasets with higher classification success performance and the software metrics with increased information gain ratio are obtained in the decision trees.

Keywords:

Software defect prediction, Decision trees, Information gain ratio,

PDF

___

Gupta D, Vinay K, Mittal GH. “Comparative study of soft computing techniques for software quality model”. International Journal of Software Engineering Research & Practices, 1(1), 33-37, 2011.
Hall T, Beecham S, Bowes D, Gray D, Counsell S. “A systematic literature review on fault prediction performance in software engineering”. IEEE Transactions on Software Engineering, 38(6), 1276-1304, 2012.
Catal C, Diri B. “A systematic review of software fault prediction studies”. Expert Systems with Applications, 36(4), 7346-7354, 2009.
Pal B, Hasan A, Aktar M, Shahdat N. “Cluster ensemble and probabilistic neural network modeling of class ımbalance learning in software defect prediction”. Artificial Intelligence and Applications, In Press.
Shirabad S, Menzies TJ. School of Information Technology and Engineering, University of Ottawa. “The PROMISE repository of software engineering databases”. http://promise.site.uottawa.ca/SERepository (01.10.2017).
Koru A, Liu H. “Building effective defect-prediction models in practice”. IEEE Software, 22(6), 23-29, 2005.
Menzies T, Dekhtyar A, Distefano J, Greenwald J. “Problems with precision: A response to comments on data mining static code attributes to learn defect predictors”. IEEE Transactions on Software Engineering, 33(9), 637-640, 2007.
Sahana DC. Software Defect Prediction Based on Classication Rule Mining. MSc Thesis, National Institute of Technology Rourkela, Rourkela, India, 2013.
Menzies T, Greenwald J, Frank A. “Data mining static code attributes to learn defect predictors”. IEEE Transactions on Software Engineering, 33(1), 2-13, 2007.
Lessmann S, Baesens B, Mues C, Pietsch S. “Benchmarking classification models for software defect prediction: A proposed framework and novel findings”. IEEE Transactions on Software Engineering, 34(4), 485-496, 2008.
Mertik M, Lenic M, Stiglic G, Kokol P. “Estimating software quality with advanced data mining techniques”. International Conference on Software Engineering Advances, Tahiti, 29 October-3 November 2006.
Pelayo L, Dick S. “Applying novel resampling strategies to software defect prediction”. Fuzzy Information Processing Society NAFIPS ’07, San Diego, USA, 24-27 June, 2007.
Magal K, Jacob SG. “Improved random forest algorithm for software defect prediction through data mining techniques”. International Journal of Computer Applications, 117(23), 18-22, 2015.
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. “SMOTE: Synthetic minority over-sampling technique”. Journal of Artificial Intelligence Research, 16, 321-357, 2002.
Quinlan JR. “Induction of decision trees”. Machine Learning, 1(1), 81-106, 1986.
Quinlan JR. C4.5: Programs for Machine Learning. San Francisco, USA, Morgan Kaufmann Publishers Inc., 1993.
Harris E. “Information gain versus gain Ratio: a study of split method biases”. 7th International Symposium on Artificial Intelligence and Mathematics, Florida, USA, 2-4 January 2002.
Frank E, Witten IH. “Generating accurate rule sets without global optimization”. 15th International Conference on Machine Learning, Wisconsin, USA, 24-27 July 1998.
Tan KC, Tay A, Lee TH, Heng CM. “Mining multiple comprehensible classification rules using genetic programming”. Proceedings of the Congress Evolutionary Computation, Hawaii, USA, 12-17 May 2002.
Li K, Zhang W, Lu Q, Fang X. “An improved SMOTE imbalanced data classification method based on support degree”. International Conference on Identification, Information and Knowledge in the Internet of Things, Beijing, China, 17-18 October 2014.
Jiang K, Lu J, Xia K. “A novel algorithm for imbalance data classification based on genetic algorithm improved SMOTE”. Arabian Journal for Science and Engineering, 41(8), 3255-3266, 2016.
Hu Y, Guo D, Fan Z, Dong C, Huang Q, Xie S, Liu, G, Tan J, Li B, Xie Q. “An Improved algorithm for ımbalanced data and small sample size classification”. Journal of Data Analysis and Information Processing, 3, 27-33, 2015.
Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH. “The WEKA Data Mining Software: An Update”. SIGKDD Explorations, 11(1), 10-18, 2009.
Tan PN, Steinbach M, Kumar V. Introduction to Data Mining, 1st ed. Boston, USA, Addison-Wesley Longman Publishing Co. Inc. 2005.
Watson AH, Mccabe TJ. Structured Testing: A Testing Methodology Using the Cyclomatic Complexity Metric. Washington, USA, National Institute of Standards and Technology Special Publication 500-235, 1996.
Tomar D, Agarwal S. “Prediction of Defective Software Modules Using Class Imbalance Learning”. Applied Computational Intelligence and Soft Computing, Article ID 7658207, 12 pages, 2016.
Stone M. “Cross-validatory choice and assessment of statistical predictions”. Journal of the Royal Statistical. Society, 36(2), 111–147, 1974.
Paramshetti P, Phalke DA. “Survey on software defect prediction using machine learning techniques”. International Journal of Science and Research, 3, 1394-1397, 2014.
Hall T, Beecham S, Bowes D, Gray D, Counsell S. “A systematic literature review on fault prediction performance in software engineering”. IEEE Transactions on Software Engineering. 38(6), 1276-304, 2012.
Wang S, Yao X. “Using Class Imbalance Learning for Software Defect Prediction”. IEEE Transactions on Reliability, 62(2), 434-43, 2013.
Aleem S, Capretz LF, Ahmed F. “Benchmarking machine learning techniques for software defect detection”. International Journal of Software Engineering & Applications, 6(3), 11-23, 2015.
Prasad M, Florence L, Arya, A. “A Study on Software Metrics Based Software Defect Prediction using Data Mining and Machine Learning Techniques”. International Journal of Database Theory and Application, 8(3), 179-190, 2015.
Menzies T, Krishna, R, Pryor, D. North Carolina State University, Department of Computer Science. “The Promise Repository of Empirical Software Engineering Data”. http://openscience.us/repo, (01.10.2017).
Frank E, Witten IH. “Generating accurate rule sets without global optimization”. 15th International Conference on Machine Learning, San Francisco, USA, 24-27 July 1998.
Martin B. Instance-Based learning: Nearest Neighbor With Generalization. MSc Thesis, University of Waikato, Hamilton, New Zealand, 1995.
Roy S. Nearest Neighbor with Generalization. MSc Thesis, University of Canterbury, Christchurch, New Zealand, 2002.
Cendrowska J. “Prism - an Algorithm for Inducing Modular Rules”. International Journal of Man-Machine Studies, 27(4), 349-70, 1987.
Japkowicz N, Stephen S. “The class imbalance problem: A systematic study”. Intelligent Data Analysis, 6(5), 429-449, 2002.
Batista G, Prati R, Monard M, “A Study of the Behavior of several methods for balancing machine learning training data”. ACM SIGKDD Explorations Special issue on learning from imbalanced datasets, 6(1), 20-29, 2004.
He H, Bai Y, Garcia EA, Li S. “ADASYN: Adaptive synthetic sampling approach for imbalanced learning”. 2008 IEEE International Joint Conference on Neural Networks, Hong Kong, China, 1-8 June 2008.
Holte RC. “Very simple classification rules perform well on most commonly used datasets”. Machine Learning, 11, 63-90, 1993.
John GH, Langley P. “Estimating Continuous Distributions in Bayesian Classifiers”. 11th Conference on Uncertainty in Artificial Intelligence, Montreal, Canada, 18-20 August 1995.
Breiman L, Friedman J, Stone CJ, Olshen RA. Classification and Regression Trees, England, Taylor & Francis, 1984.
Wang S, Yao X. “Multiclass imbalance problems: analysis and potential solutions”. IEEE Transactions on Man, Systems and Cybernetics Part B, 42(4), 1119-1130, 2012.
Breiman L. “Random Forests”. Machine Learning, 45(1), 5-32, 2001.
Specht DF. “Probabilistic neural networks”, Neural Networks, 3(1), 109-118, 1990.
Catal C, Diri B. “Investigating the effect of dataset size, metrics sets and feature selection techniques on software fault prediction problem”. Information Sciences, 179(8), 1040-1058, 2009.
Catal, C, Diri, B. “Software defect prediction using artificial ımmune recognition system”. The IASTED Int’l Conference on Software Eng, Innsbruck, Austria, 13-15 February 2007.
Cohen WW. “Fast effective rule ınduction”. 12th International Conference on Machine Learning, California, USA, 09-12 July 1995.
Han J, Kamber M, Pei J. Data Mining: Concepts and Techniques, 3rd ed. Massachusetts, USA, Morgan Kaufmann, 2011.