İbrahim Alper KÖSE, Begüm ÖZTEMUR

Kayıp Veri Ele Alma Yöntemlerinin T-Testi ve ANOVA Parametreleri Üzerine Etkisinin İncelenmesi

Bu araştırmanın amacı, kayıp veri sorunu giderme yöntemlerinin t-testi ve ANOVA parametreleri üzerine etkisinin incelenmesidir. Araştırma 50, 100, 200, 400 birimlik yapay veri setleri üzerinden yürütülmüştür. Veri setleri düşük ve yüksek korelasyonlu normal dağılıma uygun olarak oluşturulmuştur. %5, %10, %20 kayıp olacak şekilde rastgele koşullar altında eksiltilmiş veriler Tamamıyla Rassal Olarak Kayıp (TROK) yapısına uygun oluşturulmuştur. Türetilen veri setlerine kayıp veri giderme yöntemlerinden silme, yerine ortalama koyma, regresyon ve beklenti maksimizasyonu yöntemleri uygulanmıştır. Çalışma sonucunda kullanılan yöntemlerin ortaya koyduğu değerler farklı korelasyona ve farklı büyüklükteki veri setlerinde oldukça değişiklik göstermiştir. Düşük birimli veri setlerinde regresyon ve Beklenti Maksimizasyonu (BM) yöntemleri en yakın sonuçları verirken, yüksek birimli veri setlerinde regresyon ve yerine ortalama koyma yöntemi tam veri setlerine uygulanan analiz değerleriyle daha tutarlı sonuçlar vermiştir

Anahtar Kelimeler:

Kayıp Veri Analizi, Atama Yöntemleri, BM, Yerine Ortalamayı Koyma, Regresyon

Examining the Effect of Missing Data Handling Methods on the Parameters of t-Test and Anova

The purpose of this study was to examine the effect of missing data handling methods on the parameters of t-test and ANOVA. The study was conducted with simuated data sets. These data sets were produced in a way that they would have normal distributions in high and low correlation and their sizes were 50, 100, 200, 400 units. Under random conditions, data sets were reduced %5, %10, %20 in the form of MCAR. In the simulated data sets, mean substitution method, regression method, expectation-maximization (EM) method and deletion method were applied. Results showed that in different sample sizes and correlations, findings were differentiated. It is observed that in data sets with low sample sizes, regression and EM application were usefull on the other hand in data sets with larger sample sizes, mean substitution method instead of regression method had more consistent results

Keywords:

Missing Value Analysis, ImputationMethod, EM, MeanSubstitutionMethod, Regression,

PDF

___

Afiffi A. And Elashoff R. M. (1966). Missing observations in multivariate statistics: ı. review of the literature, Journal of the American Statistical Association, 61(315) 595-604.
Allison P. D. (2001). Missing data, sage university papers series on quantitativ eapplications in the social sciences, ThousandsOaks, CA, Sage.
Alpar, R. (2003). Uygulamalı çok değişkenli istatistiksel yöntemlere giriş-1, Nobel Kitabevi.
Bal C. (2003). Çok gruplu veri setlerinde kayıp gözlem sorununun çözümlenmesi ve sağlık alanında bir uygulama, Doktora Tezi,Eskişehir Osmangazi Üniversitesi Sağlık Bilimleri Enstitüsü, Eskişehir.
Baygül A, (2007). Kayıp veri analizinde sıklıkla kullanılan etkin yöntemlerin değerlendirilmesi, Yüksek Lisans Tezi, İstanbul Üniversitesi Sağlık Bilimleri Enstitüsü, İstanbul.
Cheema, J. (2012). Handling missing data in educational research using spss. Unpublished doctoral dissertation. George Mason University, USA.
Çokluk Ö. , Kayrı M., (2011). Kayıp değerlere yaklaşık değer atama yöntemlerinin ölçme araçlarının geçerlik ve güvenirliği üzerindeki etkisi, Kuram ve Uygulamada Eğitim Bilimleri, Kış; 289-309.
Dempster, A. P.,Laird, N. M, and Rubin, D. (1977). Maximum likelihood estimation from incomplete data via the EM algorithm, JournalRoyal Statistical Soc. 39
Enders C. K. (2011). Analyzing Longitudinal Data With Missing Values, USA.
EvensandFiona H. (2003). Detecting fishing underwater video using the EM algorithm, Proceeding of the IEEE International Conference on Image Processing, Barcelona.
Gildea, L. And Hofmann, T. (1999). Topic-Based language model susing EM, (http://www.cs.brown.edu/people/th/papers/GildeaHofmannEUROSPEECH9 9.pdf , 31. 01. 2013, Erişildi). Iturria, S.J. and Blangero, J. (2000). An EMalgorithmforobtainingmaximumlikelihoodestimatesınthemulti- phenotypevariancecomponents linkage model, Ann. Hum. Genet.,64.
Kalaycı Ş. (2008), SPSS Uygulamalı Çok Değişkenli İstatistik Teknikleri.
Karasar N. (2004), Bilimsel araştırma yöntemleri. Ankara : Nobel Yayıncılık.
Little R. J. A. (1998). A test of missing completely at random for multivariate data with missing values. Journal of the American Statistical Association 38, 1198- 1202.
Little R. J. A and Rubin D. R.(2002). Statistical Analysis WithMissing Data, Second Edition, Wiley, New York.
Oğuzlar A. (2001). Alan araştırmalarında kayıp değer problemi ve çözüm önerileri, V. Ulusal Ekonometri ve İstatistik Sempozyumu, Çukurova Üniversitesi İİBF Ekonometri Bölümü, Adana, 19-22 Eylül.
Rubin, D. R. (1976). Inference and missing data. Biometrika, 63 (3), 581-592.
Satıcı, E. ve Kadılar, C. (2009), Kayıp gözlem olması durumunda kitle ortalamasının tahmini, Anadolu Üniversitesi Bilim ve Teknoloji Dergisi, 10(2), 549-556.
Sezgin E. ve Çelik Y.(2013). Veri madenciliğinde kayıp veriler için kullanılan yöntemlerin karşılaştırılması, Akademik Bilişim Konferansı, Akdeniz Üniversitesi, 23-25 Ocak 2013.
Tabachnick, B. G. and Fidel (2001). L.S. Using multivariate statistics (4th ed.). Needham Heights, MA: Allyn& Bacon.
Yazıcıoğlu, Y. ve Erdoğan, S. (2007). Spss uygulamalı bilimsel araştırma yöntemleri. Ankara: Detay Yayıncılık.
Yazıcı, F. (2005). EM algoritması ve uzantıları, Yüksek Lisans Tezi, Hacettepe Üniversitesi Fen Bilimleri Enstitüsü,Ankara.
Yozgatligil, C., Purutcuoglu, V., Yazıcı, C. ve Batmaz, İ. (2011). Validity of homogeneity tests for meteorological time series data: a simulation study. "Proceedings of the 58th World Statistics Congress (ISI2011)".