Çağatay ÇATAL

The use of cross-company fault data for the software fault prediction problem

We investigated how to use cross-company (CC) data in software fault prediction and in predicting the fault labels of software modules when there are not enough fault data. This paper involves case studies of NASA projects that can be accessed from the PROMISE repository. Case studies show that CC data help build high-performance fault predictors in the absence of fault labels and remarkable results are achieved. We suggest that companies use CC data if they do not have any historical fault data when they decide to build their fault prediction models.

PDF

___

[1] Catal C, Diri, B. Investigating the effect of dataset size, metrics sets, and feature selection techniques on software fault prediction problem. Inf Sci 2009; 179: 1040-1058.
[2] Lessmann S, Baesens B, Mues C, Pietsch S. Benchmarking classification models for software defect prediction: a proposed framework and novel findings. IEEE T Software Eng 2008; 34: 485-495.
[3] Catal C, Diri, B. A systematic review of software fault prediction studies. Expert Syst Appl 2009; 36: 7346-7354.
[4] Kitchenham BA, Mendes E, Travassos GH. Cross versus within-company cost estimation studies: a systematic review. IEEE T Software Eng 2007; 33: 316-329.
[5] Turhan B, Menzies T, Bener AB, Stefano JD. On the relative value of cross-company and within-company data for defect prediction. Empir Softw Eng 2009; 14: 540-578.
[6] Zimmermann T, Nagappan N, Gall H, Giger E, Murphy B. Cross-project defect prediction: a large scale experiment on data vs. domain vs. process. In: ESEC/FSE09; 2428 August 2009; Amsterdam, the Netherlands. New York, NY, USA: ACM. pp. 91-100.
[7] Shatnawi R, Li W, Swain J, Newman T. Finding software metrics threshold values using roc curves. J Softw Maint-Res Pr 2010; 22: 1-16.
[8] Zhong S, Khoshgoftaar TM, Seliya N. Unsupervised learning for expert-based software quality estimation. In: 8th International Symposium On High Assurance Systems Engineering; 2526 March 2004; Tampa, FL, USA. New York, NY, USA: IEEE. pp. 149-155.
[9] Catal C, Sevim U, Diri B. Clustering and metrics thresholds based software fault prediction of unlabeled program modules. In: ITNG 09; 2729 April 2009; Las Vegas, NV, USA. Washington, DC, USA: IEEE. pp. 199-204.
[10] Catal C. Software mining and fault prediction. Wires Data Mining Knowl Discov 2012; 2: 420-426.
[11] Zhong S, Khoshgoftaar TM, Seliya N. Analyzing software measurement data with clustering techniques. IEEE Intell Syst 2004; 19: 20-27.
[12] Seliya N, Khoshgoftaar TM. Software quality analysis of unlabeled program modules with semi-supervised clustering. IEEE T Syst Man Cy A 2007; 37: 201-211.
[13] Seliya N, Khoshgoftaar TM, Zhong S. Semi-supervised learning for software quality estimation. In: IEEE 2004 Conference on Tools with Artificial Intelligence; 1517 November 2004; Boca Raton, FL, USA. Los Alamitos, CA, USA: IEEE. pp. 183-190.
[14] Catal C, Diri B. Unlabelled extra data do not always mean extra performance for semi-supervised fault prediction. Expert Syst 2009; 5: 458-471.
[15] Catal C, Alan O, Balkan K. Class noise detection based on software metrics and roc curves. Inf Sci 2011; 181: 4867-4877.
[16] Menahem E, Rokach L, Elovici Y. Troika - an improved stacking schema for classification tasks. Inf Sci 2009; 179: 4097-4122.
[17] Zhang M, Pena JM, Robles V. Feature selection for multi-label naive Bayes classification. Inf Sci 2009; 179: 3218- 3229.
[18] Menzies T, Greenwald J, Frank A. Data mining static code attributes to learn defect predictors. IEEE T Software Eng 2007; 33: 2-13.
[19] Menzies T, Turhan B, Bener A, Gay G, Cukic B, Jiang Y. Implications of ceiling effects in defect predictors. In: 4th International Workshop on Predictor Models in Software Engineering; 1018 May 2008; Leipzig, Germany. New York, NY, USA: ACM. pp. 47-54.
[20] Rodriguez D, Ruiz R, Riquelme JC, Aguilar-Ruiz JS. Searching for rules to detect defective modules: a subgroup discovery approach. Inf Sci 2012; 191: 14-32.
[21] Halstead M. Elements of Software Science. New York, NY, USA: Elsevier, 1977.
[22] McCabe T. A complexity measure. IEEE T Software Eng 1976; 2: 308-320.
[23] Menzies T, Milton Z, Turhan B, Cukic B, Jiang Y, Bener A. Defect prediction from static code features: current results, limitations, new approaches. Automat Softw Eng 2010; 17: 375-407.