Robust Kümeleme Yöntemi ile Grup Sapan Değerlerin Belirlenmesi

Gözlemlerin çoğunluğu tarafından önerilen modele uygun olmayan gözlem ler sapan değer olarak tanımlanır Sapan değerler genellikle gözlemlerin çoğunluğu tarafından desteklenen bilginin yok olmasına neden olurlar Sapan değer ler in belirlenmesi ile ilgili çok sayıda yaklaşım bulunmaktadır Uygun testin seçimi verinin geldiği dağılıma dağılım parametrelerine beklenen sapan değerin tipine ve sayısına bağımlıdır Bu çalışmada düşük ve yüksek boyutlu verilerde kümelemeye dayalı sapan değer belirleme yöntemi önerildi Yöntemin etkinliği gerçek ve simüle edilmiş veri kümeleri üzerinde gösterildi Anahtar Kelimeler: Kümeleme Sapan Değer belirleme Yüksek Boyutlu Veri

Robust Kümeleme Yöntemi ile Grup Sapan Değerlerin Belirlenmesi

Outliers are minority observations do not conform to the model suggested by the homogeneous majority of the observations Outliers causes lost of the information supported by the majority of the observations Many approach are exist for determination of outliers The choice of appropriate test is depend on the distribution of the data the distribution of parameters and the type or and the number of the outliers that expected In this study we proposed a new method for determining of outliers in low and high dimensional data depend on clustering The effectiveness of the method is shown on real and simulated data sets Key Words: Clustering Detecting of Outlier High Dimensional Data

___

  • Acuna E. and Rodriguez C., (2004), A Meta Analysis Study of Outlier Detection Methods in Classification, Technical paper, Department of Mathematics, University of Puerto Rico at Mayaguez, available at academic.uprm.edu/~eacuna/paperout.pdf. In proceedings IPSI 2004, Venice
  • Angiulli, F. and C. Pizzuti, Outlier Mining in Large High-Dimensional Data Sets, (2005). IEEE
  • Barnett, V. & Lewis, T. (1994) Outliers in Statistical Data, 3rd edn (Chichester:Wiley).  Billor, N., Hadi, A. S. and Velleman, P. F. (2000), ―BACON: Blocked Adaptive Computationally-Efficient Outlier Nominators,‖ Computational Statistics and Data Analysis, 34, pp. 279-298.
  • Kıral, G.,Billor N. , Turkmen A.(24-26 Mayıs 2012)‖Robust Sınıflandırma yöntemi ile grup sapan değerlerin belirlenmesi‖. Eastern Mediterranean University. Gazimağusa, Kıbrıs.
  • Billor N., Kıral G.,Turkmen A.S.(2012)―Clustering Based Robust Multivariate Outlier Detection‖ ,poster, Joint Statistical Meetings (JSM) ,July 28-August 2, 2012 San Diego,USA
  • Breunig M. M. (2001). "Quality Driven Database Mining", Ph.D. thesis, Computer Science Department, University of Munich, Munich, Germany.
  • Caroni, C. and Billor, N.(2007) 'Robust Detection of Multiple Outliers in Grouped Multivariate Data', Journal of Applied Statistics, 34: 10, 1241 — 1250
  • Cutsem, B and I. Gath, (1993). Detection of Outliers and Robust Estimation using Fuzzy Clustering, Computational Statistics & Data Analyses 15, pp. 47-61.
  • Flury, B. and Riedwyl, H. (1988), Multivariate Statistics A Practical Approach, London: Chapman and Hall.
  • Gath, I and A. Geva, (1989). Fuzzy Clustering for the Estimation of the Parameters of the Components of Mixtures of Normal Distribution, Pattern Recognition Letters, 9, pp. 77-8
  • Golub et al. (1999). Molecular classification of cancer: class discovery and class prediction by gene expression monitoring, Science, Vol. 286:531-537. Kondylis, A.2006
  • Hartigan and Wong (1979) “A K-Means Clustering Algorithm” Journal of the Royal Statistical Society. Series C (Applied Statistics) Vol. 57, No. 1 - Vol. 60, No. 5.
  • Hartigan, J., (1975). Clustering algorithms. John Wiley & Sons, New York
  • Hawkins, D., (1980). Identifications of Outliers, Chapman and Hall, London.
  • Jiang, M., S. Tseng and C. Su, (2001). Two-phase Clustering Process for Outlier Detection, Pattern Recognition Letters, 22: 691-700.
  • Kaufman, L. & Rousseeuw, P.J. (1990) Finding Groups in Data: An Introduction to Cluster Analysis (NewYork: JohnWiley).
  • Knorr, E. and R. Ng, Algorithms for Mining Distance-based Outliers in Large Data Sets, (1998) .Proc. the 24th International Conference on Very Large Databases (VLDB), pp. 392-403.
  • Knorr, E., R. Ng, and V. Tucakov, (2000). Distance-based Outliers: Algorithms and Applications. VLDB Journal, 8(3-4): 237-253.
  • Kondylis A, Hadi AS (2006) Derived components regression using the BACON algorithm.Computational Statistics and Data Analysis, 51: 556 -569
  • Loureiro,A., L. Torgo and C. Soares, (2004). Outlier Detection using Clustering Methods: a Data Cleaning Application, in Proceedings of KDNet Symposium on Knowledge-based Systems for the Public Sector. Bonn, Germany.
  • MacQueen, J., (1967). Some methods for classification and analysis of multivariate observations. In:Proc. Fifth Berkeley Sympos. Math. Statist. and Probability (Berkeley, Calif., 1965/66). Univ.California Press, Berkeley, Calif., pp. Vol. I: Statistics, pp. 281– 29
  • Papadimitriou, S., H. Kitawaga, P. Gibbons, and C. Faloutsos, (2003). LOCI: Fast outlier detection using the local correlation integral. Proc. of the International Conference on Data Engineering, pp. 315-326.
  • Ramaswami, S., R. Rastogi and K. Shim, (2000). Efficient Algorithm for Mining Outliers from Large Data Sets. Proc. ACM SIGMOD, pp. 427-438
  • Rousseeuw, P. and A. Leroy, (1996). Robust Regression and Outlier Detection, 3rd ed.. John Wiley & Sons.
  • Transactions on Knowledge and Data Engineering, 17(2): 203-215.  Wang, S.,Woodward,W.A., Gray, H.L.,Wiechecki, S. & Sain, S.R. (1997) A new test for outlier detection from a multivariate mixture distribution, Journal of Computational and Graphical Statistics, 6, pp. 285–299.
  • Willems G, Joe H and Zamar R (2009). Diagnosing multivariate outliers detected by robust estimators. J Comput Graphical Statist, 18, 73-91.
  • Zhang, J. and H. Wang, (2006). Detecting outlying subspaces for high-dimensional data: the new Task, Algorithms, and Performance, Knowledge and Information Systems, 10(3): 333-355.