PREDICTIVE PERFORMANCES OF IMPLICITLY AND EXPLICITLY ROBUST CLASSIFIERS ON

The goal of this paper is to demonstrate via extensive simulation that implicit robustness can substantially outperform explicit robust inthe pattern recognition of contaminated high dimension low sample size data.Our work specially demonstrates via extensive computational simulations and applications to real life data, that random subspace ensemble learning machines, although not explicitly structurally designed as a robustness-inducing supervised learning paradigms, outperforms the structurally robustness-seekingclassiers on high dimension low sample size datasets. Random forest (RF),which is arguably the most commonly used random subspace ensemble learning method, is compared to various robust extensions/adaptations of the discriminant analysis classier, and our work reveals that RF, although not inherently designed to be robust to outliers, substantially outperforms the existing techniques specically designed to achieve robustness. Specically, by exploring different scenarios of the sample size n and the input space dimensionality palong with the corresponding capacity κ = n/p with κ < 1, we demonstratethrough extensive simulations that regardless of the contamination rate ϵ, RF predictively outperforms the explicitly robustness-inducing classication techniques when the intrinsic dimensionality of the data is large
Keywords:

-,

___

  • U. Alon, N. Barkai, D.A. Notterman, K. Gish, S. Ybarra, D. Mack, and A.J. Levine. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissue 2 probed by oligonucleotide arrays. Proceedings of the National Academy of Sciences of the United States of America,, 96(12):6745–6750, 1999.
  • S. Bicciato, A. Luchini, and C. Di Bello. Pca disjoint models for multiclass cancer analysis using gene expression data. Bioinformatics,, 19(5):571–578, 2003.
  • A. Christmann and R. Hable. On the bootstrap approach for support vector machines and related kernel based methods. In Proceedings 59th ISI World Statistics Congress,, Hong Kong, 25-30 August 2013. Session CPS018.
  • R.A. Cooper and T.J. Weekes. Data, Models, and Statistical Analysis. Barnes and Noble,, New York, 1983.
  • C. Croux and C. Dehon. Robust linear discriminant analysis using s-estimators. The Canadian Journal of Statistics,, 29(2):473–493, 2001.
  • P. Filzmoser, S. Serneels, R. Maronna, and P.J. Van Espen. Robust multivariate methods in chemometrics. In B. Walczak, R.T. Ferre, and S. Brown, editors, Comprehensive Chemomet- rics,, pages 681–722. 2009.
  • P. Filzmoser and V. Todorov. Review of robust multivariate statistical methods in high dimension. Analytica Chimica Acta,, 705(1-2):2–14, 2011.
  • P. Filzmoser and V. Todorov. Robust tools for the imperfect world. Information Sciences,, :4–20, 2013. 30 25
  • J. H. Friedman. Exploratory Projection Pursuit. Journal of the American Statistical Associ- ation, 82(397):249–266, 1987.
  • J. H. Friedman and W. Stuetzle. Projection Pursuit Regression. Journal of the American Statistical Association, 76(376):817–823, 1981.
  • J. H. Friedman and J. W. Tukey. A Projection Pursuit algorithm for exploratory data analysis. IEEE Trans. on computers, 23(9):881–890, 1974.
  • J. H. Friedman and J.W. Tukey. A projection pursuit algorithm ror exploratory data analysis. In IEEE TRANSACTIONS ON COMPUTERS,. 1974.
  • T.R. et. al. Golub. Molecular classiŞcation of cancer:discovery and class prediction by gene expression monitoring. Science,, 286:531537, 1999. doi: 10.1126/science.286.5439.531.
  • Y. Guo, T. Hastie, and R. Tibshirani. Regularized linear discriminant analysis and its appli- cation in microarrays. Biostatistics, 8(1):86–100, 2006.
  • P. Hall. On polynomial-based projection indices for exploratory projection pursuit. The An- nals of Statistics, 17(2):589–605, 1990.
  • D. M. Hawkins and G. J. McLachlan. High-breakdown linear discriminant analysis. Journal of the American Statistical Association,, 92(437):136–143, 1997.
  • X He and W.K. Fung. High breakdown estimation for multiple populations with applications to discriminant analysis. Journal of Multivariate Analysis,, 72(2):151–162, 2000.
  • P. J. Huber. Projection Pursuit. The Annals of Statistics, 13(2):435–475, 1985. 20
  • M. Hubert and M. Debruyne. Minimum covariance determinant. WIREs Comp Stat, 2: 3643. doi: 10.1002/wics.61, 2:3643, 2010.
  • M. Hubert and S. Engelen. Robust pca and classiŞcation in biosciences. Bioinformatics,, (11):1728–1736, 2004.
  • M. Hubert and K. Van Driessen. Fast and robust discriminant analysis. Computational Sta- tistics and Data Analysis,, 45(2):301–320, 2004.
  • Y. Kondo, M. Salibian-Barrera, and R. Zamar. A robust and sparse k-means clustering algo- rithm. arXiv preprint arXiv, (1201.6082), 2012.
  • J-X. Pan, W-K. Fung, and K-T. Fang. Multiple outlier detection in multivariate data using projection pursuit techniques. Journal of Statistical Planning and Inference, 83(1):153–167, A. M. Pires. Projection-pursuit approach to robust linear discriminant analysis. Journal Mul- tivariate Analysis,, (101):2464–2485, 2010.
  • A.M. Pires. Robust linear discriminant analysis and the projection pursuit approach. In R. Dutter, P. Filzmoser, U. Gather, and P. J. Rousseeuw, editors, Developments in Robust Statistics,, pages 317–329. Physica-Verlag HD, 2003. ISNB: 978-3-642-63241-9.
  • G. Pison, S. Van Aelst, and G. Willems. Small sample corrections for lts and mcd. Metrika,, (1-2):111–123, 2002. 40
  • Figure 10. Average test error on the strong contaminated simu- lated data with g = 3. We herein reveal for each of the 7 methods, the effect of the input space dimension p on the average test error. Jrg Polzehl and Deutsche Forschungsgemeinschaft. Projection pursuit discriminant analysis. Computational Statistics and Data Analysis, 20:141–157, 1993.
  • S. et al. Pomeroy. Prediction of central nervous system embryonal tumor outcome based on gene expression. Nature,, 415:436–442, 2002.
  • G.M. Reaven and R.G. Miller. An attemp to deŞne nature of chemical diabest using a mul- tidimensional analysis. Diabetologica,, 16:17–24, 1979.
  • P. J. Rousseeuw. Least median of squares regression. Journal of the American Statistical Association,, 79(388):871–880, 1984.
  • Peter J. Rousseeuw and Katrien Van Driessen. A fast algorithm for the minimum covariance determinant estimator. Technometrics, 41:212–223, 1998.
  • P.J. Rousseeuw. Multivariate estimation with high breakdown point. In W. Grossmann, G. Pflug, I. Vincze, and W. Wertz, editors, In Mathematical Statistics and Applications,, Dordrecht, 1985. Reidel Publishing Company.
  • P.J. Rousseeuw and K. Van Driessen. A fast algorithm for the minimum covariance determi- nant estimator. Technometrics,, 41(3):212–223, 1999.
  • A. Stephenson, A.J.and Smith, M.W. Kattan, J. Satagopan, V.E. Reuter, P.T. Scardino, and W.L. Gerald. Integration of gene expression proŞling and clinical variables to predict prostate carcinoma recurrence after radical prostatectomy. Cancer,, 104(2):290–298, 2005.
  • W.B. Stern and J.-P. Descoeudres. X-ray fluorescence analysis of archaic greek pottery. Ar- chaeometry,, 19:73–86, 1977.
  • V. Todorov and P. Filzmoser. An object-oriented framework for robust multivariate analysis. Journal of Statistical Software,, 32(3):1–47, 2009.
  • V. Todorov and A. M. Pires. Comparative performance of several robust linear discriminant analysis methods. REVSTAT - Statistical Journal,, 5(1):63–83, 2007.
  • K. Vanden Branden and M. Hubert. Robust classiŞcation in high dimensions based on the simca method. Chemometrics and Intelligent Laboratory Systems,, 79:10–21, 2005.
  • S. Wold. Pattern recognition by means of disjoint principal components models. Pattern Recognition,, 8(3):127–139, 1976.
  • Necla G¨und¨uz: Fen Fak¨ultesi, Istatistik B¨ol¨um¨u,Gazi ¨Universitesi, Ankara, Turkey E-mail address: ngunduz@gazi.edu.tr Ernest Fokou´e: School of Mathematical Sciences, Rochester Institute of Technology, Rochester, New York 14623, USA
  • E-mail address: epfeqa@rit.edu