High leverage points and vertical outliers resistant model selection in regression

It is necessary to consider only relevant predictor variables for prediction purpose because irrelevant predictors in the regression model will tend to misleading inference. There are so many model selection methods available in the literature; among these, some methods are resistant to vertical outliers, but still, the problem of the presence of vertical outliers and leverage points is not well studied. In this article, we have modified the Sp statistic using the generalized M-estimator for robust model selection in the presence of vertical outliers and high leverage points. The proposed model selection criterion selects only relevant predictor variables by probability one for a large sample size. We found the equivalence of this criterion and the existing Cp and Sp criteria. The superiority of a proposed criterion is demonstrated using simulated and real data.

___

  • [1] M. Alguraibawi, H. Midi and A.H.M. Imon, A new robust diagnostic plot for classifying good and bad high leverage points in a multiple linear regression model, Math. Probl. Eng., Doi:10.1155/2015/279472, 2015.
  • [2] C.D.S. André, S.N. Elian, S.C. Narula and R.A. Tavares, Coefficients of determinations for variable selection in the msae regression, Comm. Statist. Theory Methods 29 (3), 623-642, 2000.
  • [3] B. Asikgil and A. Erar, Research into multiple outliers in linear regression analysis, Hacet. J. Math. Stat. 38 (2), 185-198, 2009.
  • [4] A.C. Atkinson, Fast very robust methods for the detection of multiple outliers, J. Amer. Statist. Assoc. 89 (428), 1329-1339, 1994.
  • [5] A. Bab-Hadiashar and D. Suter, Data Segmentation and Model Selection for Computer Vision: A Statistical Approach, Springer, 2000.
  • [6] D.A. Belsley, E. Kuh and R.E. Welsch, Regression Diagnostics: Identifying Influential Data and Sources of Collinearity, Wiley, 1980.
  • [7] H. Bozdogan and D.M.A. Haughton, Informational complexity criteria for regression models, Comput. Statist. Data Anal. 28 (1), 51-76, 1998.
  • [8] S. Chatterjee and A.S. Hadi, Influential observations, high leverage points and outliers in linear regression, Statist. Sci. 1 (3), 379-416, 1986.
  • [9] C.W. Coakley and T.P. Hettmansperger, A bounded influence, high breakdown, efficient regression estimator, J. Amer. Statist. Assoc. 88 (423), 872-880, 1993.
  • [10] C. Croux and C. Dehon, Estimators of the multiple correlation coefficient: Local robustness and confidence intervals, Statist. Papers 44 (3), 315-334, 2003.
  • [11] C. Dehon, M. Gassner and V. Verardi, Beware of ‘Good’ outliers and overoptimistic conclusions, Oxf. Bull. Econ. Stat. 71 (3), 437-452, 2009.
  • [12] A.S. Hadi and J.S. Simonoff, Procedures for the identification of multiple outliers in linear models, J. Amer. Statist. Assoc. 88 (424), 1264-1272, 1993.
  • [13] F. R. Hampel, E.M. Ronchetti, P.J. Rousseeuw and W.A. Stahel, Robust Statistics: The Approach based on Influence Functions, Wiley, 1986.
  • [14] R.W. Hill, When there are outliers in the carriers: The univariate case, Comm. Statist. Theory Methods 11 (8), 849-868, 1982.
  • [15] P.W. Holland and R.E. Welsch, Robust regression using iteratively reweighted leastsquares, Comm. Statist. Theory Methods 6 (9), 813-827, 1977.
  • [16] B. Hu and J. Shao, Generalized linear model selection using R2 , J. Statist. Plann. Inference 138 (12), 3705-3712, 2008.
  • [17] P.J. Huber, Robust estimation of a location parameter, Ann. Math. Stat. 35 (1), 73-101, 1964.
  • [18] D.N. Kashid and S.R. Kulkarni, A more general criterion for subset selection in multiple linear regression, Comm. Statist. Theory Methods 31 (5), 795-811, 2002.
  • [19] C. Kim and S. Hwang, Influence subsets on the variable selection, Comm. Statist. Theory Methods 29 (2), 335-347, 2000.
  • [20] W.S. Krasker and R.E. Welsch, Efficient bounded-influence regression estimation, J. Amer. Statist. Assoc. 77 (379), 595-604, 1982.
  • [21] J.A.F. Machado, Robust Model Selection and M-estimation, Econometric Theory 9 (3), 478-493, 1993.
  • [22] C. Mallows, Some comments on Cp, Technometrics 15 (4), 661-675, 1973.
  • [23] R.A. Maronna, R.D. Martin and V.J. Yohai, Robust Statistics: Theory and Methods, Wiley, 2006.
  • [24] R.A. Maronna and V.J. Yohai, Asymptotic behavior of general M-estimates for regression and scale with random carriers, Probab. Theory Related Fields 58 (1), 7-20, 1981.
  • [25] C.R. Rao, Y. Wu, S. Konishi and R. Mukerjee, On model selection, Lecture Notes in Monograph Series 38, 1-64, 2001.
  • [26] O. Renaud and M.P. Victoria-Feser, A robust coefficient of determination for regression, J. Statist. Plann. Inference 140 (7), 1852-1862, 2010.
  • [27] E. Ronchetti, Robust model selection in regression, Statist. Probab. Lett. 3 (1), 21-23, 1985.
  • [28] E. Ronchetti and R.G. Staudte, A robust version of Mallows’s Cp, J. Amer. Statist. Assoc. 89 (426), 550-559, 1994.
  • [29] P.J. Rousseeuw and A.M. Leroy, Robust Regression and Outlier Detection, Wiley, 2003.
  • [30] P.J. Rousseeuw and B.C. Van Zomeren, Unmasking multivariate outliers and leverage points, J. Amer. Statist. Assoc. 85 (411), 633-639, 1990.
  • [31] S. Sommer and R.G. Staudte, Robust variable selection in regression in the presence of outliers and leverage points, Aust. N. Z. J. Stat. 37 (3), 323-336, 1995.
  • [32] K. Tharmaratnam and G. Claeskens, A comparison of robust versions of the AIC based on M, S and MM-estimators, Statistics 47 (1), 216-235, 2013.
  • [33] R. Wilcox, Introduction to Robust Estimation and Hypothesis Testing, Elsevier, 2012.
  • [34] R. Wilcox, Modern Statistics for the Social and Behavioral Sciences: A Practical Introduction, CRC Press, 2012.