Çoklu Regresyon Modellerinde Genetik Algoritma ve Bayes Bilgi Kriteri Kullanarak Sapan Değerlerin Belirlenmesi

İstatistiksel modeller; özellikle regresyon modelleri, veri setlerinin önemli özelliklerinin anlaşılması ve ortaya çıkarılmasında en çok kullanılan araçlardandır. Bununla birlikte, gerçek hayatta birçok veri seti genellikle sapan değer olarak adlandırılan belirli miktardaki anormal değerler içerebilmektedir. Sapan değerlerin doğru bir şekilde tespit edilmesi, istatistiksel çözümlemelerde özellikle de regresyon modellerinde önemli bir rol oynar. Buna rağmen, birçok klasik istatistiksel modeller sapan değer içeren veri setlerine de uygulanmakta, nihayetinde de sonuçlar yanıltıcı olmaktadır. Sapan değerler, uygun olan çoklu regresyon modelinin belirlenmesini de güçleştirir.

Outlier Detection in Multiple Regression Models Using Genetic Algorithms and Bayesian Information Criteria

Statistical models, particularly regression models, are most useful devices for extracting and understanding the essential features of datasets. However, most of the databases in real-world include a particular amount of abnormal values, generally termed as outliers. An accurate identification of outliers plays a significant role in statistical analysis especially regression models. Nevertheless, many classical statistical models are blindly applied to data sets containing outliers, the results can be misleading at best. The appearance of outliers can exert negative influences on the fit of the multiple regression models. The aim of this study is to define outlier detection method using Genetic Algorithms (GA) with Bayesian Information Criterion (BIC) and to illustrate the algorithm with real and simulation data. We use a fitness function which is based on BIC in this algorithm. The criteria’s value indicates a better model to fit data, the presence of one or more outliers will negatively impact the regression model and result in larger BIC values.

___

  • Abe, N., Zadronzy, B., and Langford, J., 2006. Outlier detection by active learning. ACM. Proceedings of the 12th ACM SIGKDD International conference on Knowledge Discovery and Data Mining, 767-772, New York, USA.
  • Acuna, E., and Rodriguez, C., 2005. On detection of outliers and their effect in supervised classification, http://academic.uprm.edu/~eacuna/vene31.pdf, 30 April 2008.
  • Amidan, B., Ferryman, and T., Cooley S., 2005. Data outlier detection using the Chebyshew theorem. IEEE Aerospace Conference Proceedings, IEEE, Piscataway NJ USA, 3814-3819.
  • Atkinson, A.C., 1986. Influential observations, high leverage points, and outliers in linear regression. Statistical Science, 1, 397-402.
  • Barnett, V., and Lewis, T., 1994. Outliers in statistical data. John Wiley and Sons, USA.
  • Ben-Gal I., 2005. Outlier detection.,131-146. In: Maimon O. and Rokach L., Data mining and knowledge discovery handbook. Springer, USA.
  • Bozdogan, H., 2004. Statistical data mining and knowledge discovery. Chapman and Hall/CRC, USA.
  • Breitenbach, M., and Grudic, G.Z., 2005. Clustering through ranking on manifolds. Proceedings of the 22nd International Conference on Machine Learning, 73-80, New York, USA.
  • Davies L., and Gather U., 1993. The identification of multiple outliers. Journal of the American Statistical Association, 88, (423), 797-801.
  • Fox, J., 1997. Applied regression analysis, linear models and related methods. Sage Publication, USA.
  • Goldberg, D.E., 1989. Genetic algorithms in search, optimization, and machine learning. Addison-Wesley, USA.
  • Hadi, A., 1986. Influential observations, high leverage points, and outliers in linear regression. Journal of the American Statistical Association, Statistical Science, 1 (3), 379-393
  • Hoaglin, D., and Tukey, J., 1983. Understanding robust and exploratory data analysis. John Wiley and Sons, Canada
  • Hoeting, J., Raftery, A.E., and Madigan, D., 1996. A method for simultaneous variable selection and outlier identification in linear regression. Computational Statistics and Data Analysis, 22, 251-270.
  • Ishibuchi, H., Nakashima, T., and Nii, M., 2001. Genetic algorithm based instance and feature selection. In: Liu, H., and Motoda, H., Instance selection and construction for data mining, Kluwer Academic.
  • Jann, A., 2000. Multiple change point detection with a genetic algorithm. Soft Computing, 4, 68-75.
  • Kullback, S., 1996. Information theory and statistics. Dover Publications, USA.
  • MacQueen, J.B., 1967. Some methods for classification and analysis of multivariate observations. IProceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability, 1, 281-297.
  • Rothlauf, F., 2006. Representations for genetic and evolutionary algorithms. Springer, Netherlands.
  • Scott, D.W., 2005. Outlier detection and clustering by partial mixture modeling. Phsica-Verlag. In COMPSTAT 2004 Symposium, 453-465, Heidelberg.
  • Tolvi, J., 2004. Genetic algorithms for outlier detection and variable selection in linear regression models. Soft Computing, Springer, 527-533.