Robust correlation scaled principal component regression
Robust correlation scaled principal component regression
In multiple regression, different techniques are available to deal with the situation where the predictors are large in number, and multicollinearity exists among them. Some of these approaches rely on correlation and others depend on principal components. To cope with the influential observations (outliers, leverage, or both) in the data matrix for regression purposes, two techniques are proposed in this paper. These are Robust Correlation Based Regression (RCBR) and Robust Correlation Scaled Principal Component Regression (RCSPCR). These proposed methods are compared with the existing methods, i.e., traditional Principal Component Regression (PCR), Correlation Scaled Principal Component Regression (CSPCR), and Correlation Based Regression (CBR). Also, Macro (Missingness and cellwise and row-wise outliers) RCSPCR is proposed to cope with the problem of multicollinearity, the high dimensionality of the dataset, outliers, and missing observations simultaneously. The proposed techniques are assessed by considering several simulated scenarios with appropriate levels of contamination. The results indicate that the suggested techniques seem to be more reliable for analyzing the data with missingness and outlyingness. Additionally, real-life data applications are also used to illustrate the performance of the proposed methods.
___
- [1] H. Abdi and L.J. Williams, Principal component analysis, Wiley Interdiscip. Rev.
Comput. Stat. 2 (4), 433-459, 2010.
- [2] C. Agostinelli, A. Leung, V.J. Yohai and R.H. Zamar, Robust estimation of multivariate
location and scatter in the presence of cellwise and casewise contamination,
Test 24 (3), 441-461, 2015.
- [3] M.H. Ahmad, R. Adnan and N. Adnan, A comparative study on some methods for
handling multicollinearity problems, Matematika (Johor) 22 (2), 109-119, 2006.
- [4] A. Alfons, Package “robustHD: Robust methods for high-dimensional data”, R package
version: 0.5.1, 2016.
- [5] A. Alin, Multicollinearity, Wiley Interdiscip. Rev. Comput. Stat. 2 (3), 370-374, 2010.
- [6] O.G. Alma, Comparison of robust regression methods in linear regression, Int. J.
Contemp. Math. Sciences 6 (9), 409-421, 2011.
- [7] F. Alqallaf, S. Van Aelst, V.J. Yohai and R.H. Zamar, Propagation of outliers in
multivariate data, Ann. Statist. 37 (1), 311-31, 2009.
- [8] R. Andersen, Modern Methods for Robust Regression, Sage, 2008.
- [9] A.C. Atkinson, Regression diagnostics, transformations and constructed variables, J.
R. Stat. Soc. Ser. B. Stat. Methodol. 44 (1), 1-22, 1982.
- [10] V. Barnett and T. Lewis, Outliers in Statistical Data, John Wiley & Sons, New York,
1994.
- [11] D.A. Belsley, E. Kuh and R.E.Welsch, Regression Diagnostics: Identifying Influential
Data and Sources of Collinearity, John Wiley & Sons, 2005.
- [12] D. Blatná, Outliers in regression, Trutnov 30, 1-6, 2006.
- [13] B. Campos, F. Paredes, J. Rey, D. Lobo and S. Galvis-Causil, The relationship between
the normalized difference vegetation index, rainfall, and potential evapotranspiration
in a banana plantation of Venezuela, Sains Tanah 18 (1), 58-64, 2021.
- [14] S. Chatterjee and A.S. Hadi, Influential observations, high leverage points, and outliers
in linear regression, Statist. Sci. 1 (3), 379-393, 1986.
- [15] S. Chatterjee and A. S. Hadi, Regression Analysis by Example, John Wiley & Sons,
2015.
- [16] R.D. Cook, Detection of influential observation in linear regression, Technometrics
19 (1), 15-18, 1977.
- [17] R.D. Cook and S. Weisberg, Residuals and Influence in Regression, Chapman and
Hall, New York, 1982.
- [18] C. Daniel and F.S. Wood, Fitting Equations to Data: Computer Analysis of Multifactor
Data, John Wiley & Sons, 1980.
- [19] M. Denhere and N. Billor, Robust principal component functional logistic regression,
Comm. Statist. Simulation Comput. 45 (1), 264-281, 2016.
- [20] A.K. Dey, M.A. Hossain and K.P. Das, Regression analysis for data containing outliers
and high leverage points, Ala. J. Math 39, 1-6, 2015.
- [21] N.R. Draper and H. Smith, Applied Regression Analysis, John Wiley & Sons, 1998.
- [22] F.Y. Edgeworth, On observations relating to several quantities, Hermathena 6 (13),
279285, 1887.
- [23] W.J. Egan and S.L. Morgan, Outlier detection in multivariate analytical chemical
data, Anal. Chem. 70 (11), 2372-2379, 1998.
- [24] S. Engelen, M. Hubert, K.V. Branden and S. Verboven, Robust PCR and Robust
PLSR: a comparative study, in: Theory and Applications of Recent Robust Methods,
105-117, Birkhäuser, Basel, 2004.
- [25] P. Filzmoser, Robust principal component regression, in: Proceedings of the Sixth
International Conference on Computer Data Analysis and Modeling, 132137, Minsk,
Belarus, 2001.
- [26] P. Gagnon, M. Bédard and A. Desgagné, An automatic robust Bayesian approach to
principal component regression, J. Appl. Stat. 48 (1), 84-104, 2021.
- [27] D.N. Gujarati, Basic Econometrics, Tata McGraw-Hill Education, New York, 2009.
- [28] L.C. Hamilton, Statistics with Stata: Version 12, Cengage Learning, 2012.
- [29] H.V. Henderson and P.F. Velleman, Building multiple regression models interactively,
Biometrics 37 (2), 391-411, 1981.
- [30] D.C. Hoaglin and R.E. Welsch, The hat matrix in regression and ANOVA, Amer.
Statist. 32 (1), 17-22, 1987.
- [31] R.R. Hocking, Developments in linear regression methodology: 19591982, Technometrics
25 (3), 219230, 1983.
- [32] A.E. Hoerl and R.W. Kennard, Ridge regression: Biased estimation for nonorthogonal
problems, Technometrics 12 (1), 5567, 1970.
- [33] P.J. Huber, Robust estimation of a location parameter, Ann. Math. Statist. 35, 73-
101, 1964.
- [34] M. Hubert, P.J. Rousseeuw and W. Van den Bossche, MacroPCA: An all-in-one
PCA method allowing for missing values as well as cellwise and rowwise outliers,
Technometrics 61 (4), 459-473, 2019.
- [35] M. Hubert, P.J. Rousseeuw and K.V. Branden, ROBPCA: a new approach to robust
principal component analysis, Technometrics 47 (1), 64-79, 2005.
- [36] M. Hubert and S. Verboven, A robust PCR method for highdimensional regressors, J.
Chemom. 17 (89) 438-452, 2003.
- [37] R.A. Johnson and D.W. Wichern, Applied Multivariate Statistical Analysis, Pearson,
London, 2014.
- [38] I.T. Jolliffe, A note on the use of principal components in regression, J. R. Stat. Soc.
Ser. C. Appl. Stat. 31 (3), 300-303, 1982.
- [39] I.T. Jolliffe, Principal components in regression analysis, in: Principal Component
Analysis, 129-155, Springer, New York, 1986.
- [40] I.T. Jolliffe, Principal component analysis, in: Encyclopedia of Statistics in Behavioural
Science, John Wiley & Sons, 2005.
- [41] H.A. Kiers, Weighted least squares fitting using ordinary least squares algorithms,
Psychometrika 62 (2), 251-266, 1997.
- [42] M.R. Lavery, P. Acharya, S.A. Sivo and L. Xu, Number of predictors and multicollinearity:
What are their effects on error and bias in regression?, Comm. Statist.
Simulation Comput. 48 (1), 27-38, 2019.
- [43] G. Li and Z. Chen, Projection-pursuit approach to robust dispersion matrices and
principal components: Primary theory and Monte Carlo, J. Amer. Statist. Assoc. 80
(391), 759-66, 1985.
- [44] D.C. Montgomery and A.E. Peck, Introduction to Linear Regression Analysis, John
Wiley & Sons, New York, 1982.
- [45] E. Montenegro, J. Pitti and O.B. Olivares, Identification of the main subsistence crops
of Teribe: a case study based on multivariate techniques, Idesia 39 (3), 83-94, 2021.
- [46] P.R. Nelson, P.A. Taylor and J.F. MacGregor, Missing data methods in PCA and
PLS: Score calculations with incomplete observations, Chemom. Intell. Lab. Syst. 35
(1), 45-65, 1996.
- [47] J. Neter, M.H. Kutner, C.J. Nachtsheim and W. Wasserman, Applied Linear Statistical
Models, McGraw-Hill, New York, 1996.
- [48] K. Ntotsis and A. Karagrigoriou, The impact of multicollinearity on big data multivariate
analysis modeling, in: I. Dimotikalis, A. Karagrigoriou, C. Parpoula and
C. Skiadas (ed.) Applied Modeling Techniques and Data Analysis 1: Computational
Data Analysis methods and Tools, ISTE, 2021.
- [49] O.B. Olivares, Determination of the Potential Influence of Soil in the Differentiation
of Productivity and in the Classification of Susceptible Areas to Banana Wilt in
Venezuela, 89-111, UCOPress, Spain, 2022.
- [50] O.B. Olivares, M. Araya-Alman and C. Acevedo-Opazo, Relationship between soil
properties and banana productivity in the two main cultivation areas in Venezuela,
Soil Sci. Plant Nutr. 20 (3), 2512-2524, 2020.
- [51] O.B. Olivares, J. Calero, J.C. Rey, D. Lobo, B.B. Landa and J.A. Gómez, Correlation
of banana productivity levels and soil morphological properties using regularized
optimal scaling regression, Catena 208, 105718, 2022.
- [52] O.B. Olivares and R. Hernández, Application of multivariate techniques in the agricultural
lands aptitude in Carabobo, Venezuela, Trop. Subtrop. Agroecosystems 23
(2), 1-12, 2020.
- [53] O.B. Olivares, J. Pitti and E. Montenegro, Socioeconomic characterization of Bocas
del Toro in Panama: an application of multivariate techniques, Rev. Bras. de Gestao
e Desenvolv. Reg. 16 (3), 59-71, 2020.
- [54] S. Paul, Sequential detection of unusual points in regression, J. R. Stat. Soc. Ser. D.
(The Statistician) 32 (4), 417-424, 1983.
- [55] R.K. Paul, Multicollinearity: Causes, effects and remedies, IASRI, New Delhi, 2006.
- [56] R.J. Pell, Multiple outlier detection for multivariate calibration using robust statistical
techniques, Chemom. Intell. Lab. Syst. 52 (1), 87-104, 2000.
- [57] D. Pena and V. Yohai, A fast procedure for outlier diagnostics in large regression
problems, J. Amer. Statist. Assoc. 94 (446), 434-445, 1999.
- [58] J. Pitti, O.B. Olivares and E. Montenegro, The role of agriculture in the Changuinola
District: a case of applied economics in Panama, Trop. Subtrop. Agroecosystems 25
(1), 1-11, 2021.
- [59] O. Renaud and M.P. Victoria-Feser, A robust coefficient of determination for regression,
J. Statist. Plann. Inference 140 (7), 1852-1862, 2010.
- [60] P.J. Rousseeuw, Least median of squares regression, J. Amer. Statist. Assoc. 79 (388),
871-880, 1984.
- [61] P.J. Rousseeuw and W.V.D. Bossche, Detecting deviating data cells, Technometrics
60 (2), 135-145, 2018.
- [62] P.J. Rousseeuw and A.M. Leroy, Robust Regression and Outlier Detection, John Wiley
& Sons, 1987.
- [63] P.J. Rousseeuw and A.M. Leroy, A robust scale estimator based on the shortest half,
Stat. Neerl. Statistica 42 (2), 103-116, 1988.
- [64] P.J. Rousseeuw and B.C. Van Zomeren, Unmasking multivariate outliers and leverage
points, J. Amer. Statist. Assoc. 85 (411), 633-639, 1990.
- [65] P.J. Rousseeuw and V. Yohai, Robust regression by means of S-estimators, in: Robust
and Nonlinear Time Series Analysis, 256-272, Springer, New York, 1984.
- [66] G.A. Seber and A.J. Lee, Linear Regression Analysis, John Wiley & Sons, 2012.
- [67] S. Serneels and T. Verdonck, Principal component analysis for data containing outliers
and missing elements, Comput. Statist. Data Anal. 52 (3), 1712-1727, 2008.
- [68] N. Shrestha, Detecting multicollinearity in regression analysis, Am. J. Appl. Math.
Stat. 8 (2), 39-42, 2020.
- [69] A.F. Siegel and R.H. Benson, A robust comparison of biological shapes, Biometrics
38 (2), 341-350, 1982.
- [70] K.K. Singh, A. Patel and C. Sadu, Correlation scaled principal component regression,
in: Intelligent Systems Design and Applications, 17th International Conference on
Intelligent Systems Design and Applications, 350-356, Springer, 2017.
- [71] I. Stanimirova, M. Daszykowski and B. Walczak, Dealing with missing values and
outliers in principal component analysis, Talanta 72 (1), 172-178, 2007.
- [72] J.P. Stevens, Outliers and influential data points in regression analysis, Psychol. Bull.
95 (2), 334, 1984.
- [73] M. Suhail, S. Chand and B.G. Kibria, Quantile based estimation of biasing parameters
in ridge regression model, Comm. Statist. Simulation Comput. 49 (10), 2732-2744,
2020.
- [74] Y. Susanti and H. Pratiwi, M estimation, S estimation, and MM estimation in robust
regression, Int. J. Pure Appl. Math. 91 (3), 349-360, 2014.
- [75] M.A. Ullah and G.R. Pasha, The origin and developments of influence measures in
regression, Pakistan J. Statist. 25 (3), 2009.
- [76] B. Walczak and D.L. Massart, Robust principal components regression as a detection
tool for outliers, Chemom. Intell. Lab. Syst. 27 (1), 41-54, 1995.
- [77] C. Yale and A.B. Forsythe, Winsorized regression, Technometrics 18 (3), 291-300,
1976.
- [78] M.H. Zhang, Q.S. Xu and D.L. Massart, Robust principal components regression based
on principal sensitivity vectors, Chemom. Intell. Lab. Syst. 67 (2), 175-185, 2003.
- [79] N. Zhao, Q. Xu, M.L. Tang, B. Jiang, Z. Chen and H.Wang, Highdimensional variable
screening under multicollinearity, Chemom. Intell. Lab. Syst. 9 (1), 1-11, 2020.