Yılmaz Orhun GÜRLÜK, Mediha KORKMAZ, Gizem CÖMERT, Ömer Emre Can ALAGÖZ

PUANLAYICILAR ARASI UYUMUN FARKLI ÖLÇEKLEME TÜRLERİ, PUANLAYICI SAYISI VE PUANLANAN SAYISI AÇISINDAN İNCELENMESİ

Bu araştırmada klasik kuramlara göre puanlayıcılar arası uyum katsayılarını karşılaştırmak amaçlanmıştır. Farklı ölçekleme türlerine göre elde edilen katsayılar üzerinden hesaplanan değerler arasındaki farka odaklanılmış ve ölçekleme türüne karar vermenin önemi ortaya konmuştur. Puanlanan ve puanlayıcı sayısının değişmesinin değerleri etkileyip etkilemediğine bakılmış ve genellenebilirlik kuramının optimizasyon analizi kullanılarak puanlayıcılar arası uyum için kullanılacak en uygun örneklem büyüklüğü hesaplanmıştır. Araştırmada toplamda 35 çocuğa Bender Görsel Motor Gestalt II testinin yaş gruplarında ortak olan 9 kopyalama kartı uygulanmış ve alınan ölçümler toplamda 8 puanlayıcı tarafından birbirlerine kör olarak değerlendirilmiştir. Sonuçlara göre en yüksek uyum değeri sınıf içi korelasyon katsayısında hesaplanmış ve bu değeri sırasıyla Krippendorff alfa, Fleiss kappa ve Cohen kappa takip etmiştir. Hem puanlanan hem de puanlayıcı sayısı azaldıkça uyum değerlerinin düşme eğiliminde olduğu tespit edilmiştir. Öte yandan kartların zorluk düzeyinin anlamlı bir etkisi olmadığı saptanmıştır. Genellenebilirlik katsayılarının yüksek çıkması testin puanlayıcılar tarafından güvenilir şekilde puanlandığını göstermiştir. Optimizasyon analizi incelendiğinde bu test için en uygun örneklem büyüklüğünün 50 olduğu görülmüştür. Katılımcı sayısının 50’den fazla olması ise uyumu arttırmamıştır.

Anahtar Kelimeler:

Puanlayıcılar Arası Uyum, Ölçekleme Türü, Sınıf İçi Korelasyon, Kappa, Krippendorff Alfa, Genellenebilirlik Kuramı

EXAMINING THE INTERRATER RELIABILITY ACCORDING TO DIFFERENT SCALING TYPES, NUMBER OF RATERS AND NUMBER OF RATED SUBJECTS

In this study, it was aimed to compare the coefficients of interrater agreement according to classical statistic theories. The difference between the calculated agreement coefficients according to different scaling types has been focused and the importance of deciding on the scaling type has been revealed. It was examined whether the change in the number of raters and rateds affected the values, and the most appropriate sample size to be used for the interrater agreement was calculated by using the optimization analysis of the generalizability theory. In the study, 9 cards of the Bender-Gestalt motor skill test, which can be seen by everyone, were applied to 35 children in total, and the measurements were evaluated by 8 raters blindly to each other. Accordingly, the highest agreement value was calculated in the intra-class correlation coefficient and this value was followed by Krippendorff alpha, Fleiss kappa and Cohen kappa, respectively. It has been determined that as both the number of rateds and raters decrease, the agreement values tend to decrease. On the other hand, it was determined that the difficulty level of the cards did not have a significant effect. The high generalizability coefficients showed that the test was reliably scored by the raters. When the optimization analysis was examined, it was seen that the most suitable sample size for this test was 50. Having more than 50 participants did not increase agreement.

Keywords:

Interrater Agreement, Scaling Types, Intraclass Correlation, Kappa, Krippendorff Alpha, Generalizability Theory,

PDF

___

Abedi, J., Baker, E. L. & Herl, H. (1995). Comparing reliability indices obtained by different approaches for performance assessments. Los Angeles: University of California, CSE Technical Report, 401.
Arslan Mancar, S. (2019). Performansa dayalı durum belirlemede puanlayıcılar arası güvenirlik tekniklerinin karşılaştırılması. Yayınlanmış yüksek lisans tezi, Ankara Üniversitesi, Ankara.
Ateş, C., Öztuna, D. & Gen. Y. (2009). Sağlık araştırmalarında sınıf içi korelasyon katsayısının kullanımı. Türkiye Klinikleri 1 (2), 59-64.
Atılgan, H. (2019). Genellenebilirlik Kuramı ve Uygulaması (1. Basım). Ankara: Anı Yayınları.
Atmaz, G. (2009). Puanlama yönergesi kullanılması durumunda puanlayıcı güvenirliğinin incelenmesi. Yayınlanmamış yüksek lisans tezi, Mersin Üniversitesi, Mersin.
Bıkmaz, Ö. (2011). Üst düzey zihinsel özelliklerin ölçülmesinde puanlayıcılar arası güvenirlik belirleme tekniklerinin karşılaştırılması. Yüksek lisans tezi. Hacettepe Üniversitesi Sosyal Bilimler Enstitüsü Eğitim Bilimleri Ana Bilim Dalı Eğitimde Ölçme ve Değerlendirme Bilim Dalı, Ankara.
Bıkmaz Bilgen, Ö. & Doğan, N. (2017). Puanlayıcılar arası güvenirlik belirleme tekniklerinin karşılaştırılması. Eğitimde ve Psikolojide Ölçme ve Değerlendirme Dergisi, 8 (1), 63-78.
Bishop, P. A. & Herron R. L. (2015). Use and Misuse of the Likert Item Responses and Other Ordinal Measures. International Journal of Exercise Science, 8 (3), 297-302.
Brannigan, G. G. & Decker, S. L. (2003). Bender Gestalt II Bender Visual-Motor Gestalt Test (Second Edition). Itasca, IL: Riverside Publishing.
Briesch, A. M., Swaminathan, H., Welsh, M. & Chafouleas, S. M. (2014). Generalizability theory: a practical guide to study design, implementation, and interpretation. Journal of School Psychology, 52 (1), 13-35.
Cardinet, J., Johnson, S. & Pini, G. (2009). Applying Generalizability Theory using EduG. New York: Routledge Academic.
Carifio, J. & Perla, R. (2008). Resolving the 50-year debate around using and misusing Likert scales. Medical Education, 42 (12), 1150–1152.
Cohen (1960). A coefficient of rater agreement for nominal scales. Educational and Psychological Measurement, 20 (1), 37-46.
Cohen, J. R., Swerdlik, E. M. & Phillips, S. M. (1996). Psychological Testing and Assessment (Third Edition). London: Mayfield Publishing Compony.
Crocker, L., & Algina, J. (1986). Introduction to Classical and Modern Test Theory. New York: Harcourt Brace.
Cronbach, L. J., Rajanatham, N. & Gleser, G. C. (1963). Theory of generalizability: a liberation of reliability theory. The British Journal of Statistical Psychology, 16 (2), 137-163.
Doğan, İ & Doğan, N. (2014). Adım Adım Çözümlü Parametrik Olmayan İstatistiksel Yöntemler, 4-5. Ankara: Detay Yayıncılık.
Erkuş, A. (1999). Ölçme araçlarının tutarlı ölçme ve sınıflama yapıp yapmadığını belirlemeye yönelik bir araştırma. Doktora Tezi. Ankara Üniversitesi Sosyal Bilimler Enstitüsü Eğitimde Psikolojik Hizmetler Ana Bilim Dalı, Ankara.
Fisher, R. A. (1950). Statistical Method for Research Workers (Eleventh Edition). Edinburg: Oliver and Boyd.
Fleiss, J. L., Cohen, J. & Everitt, B. S. (1969). Large sample standard errors of kappa and weighted kappa. Psychological Bulletin, 72 (5), 323–327.
Fleiss, J. L. (1971). Measuring agreement for multinomial data. Psychological Bulletin, 76 (5), 378-382.
Hallgren, K. (2012). Computimg inter-rater reliability for observational data: an overview and tutorial. Tutorials in Quantitative Methods for Psychology, 8 (1), 23-34.
Hayes, A. F. & Krippendorff, K. (2007). Answering the call for a standard reliability measuse for coding. Communication Methods and Measures, 1 (1), 77-89.
Korkmaz, M., Demiral, N., Sapmaz-Yurtsever, S., Kaçar-Başaran, S. & Çabuk, T.(2019). Bender Gestalt-II Testinin Koppitz II ve Bender Gestalt II Puanlama Sistemlerine Göre (4-7 ve 8-18 Yaş) Ön Norm Çalışması, Ege Üniversitesi Bilimsel Araştırma Projesi, Proje No: 15-EDB-021, İzmir.
Korkmaz, M., Sapmaz-Yurtsever, S.,Kaçar-Başaran,S., Demiral, N. & Çabuk, T. (2022). Bender-Gestalt II Test: Psychometric Properties with Global Scoring System on a Turkish Standardization Sample, Child Neuropsychology, https://doi.org/10.1080/09297049.2022.2104237.
Koo, T. K. & Li, M. Y. (2010). A guideline of selecting and reporting intraclass correlation coefficients for reliability research. Journal of Chiropractic Medicine, 15 (2), 155-163.
Krippendorff, K. (1970). Estimating the reliability, systematic error, and random error of interval data. Educational and Psychological Measurement, 30 (1), 61–70.
Krippendoff, K. (1995). On the reliability of unitizing continious data. Sociological Methodology, 25, 47-76.
Krippendorff, K. (2004a). Content Analysis An Introduction to Its Methodology (Second Edition). Thousand Oaks, CA: Sage Publication.
Krippendorff, K. (2004b). Reliability in content analysis some common misconceptions and recommendations. Human Communication Research, 30 (3), 411-33.
Landis, J. R. & Koch, G. G. (1977) An application of hierarchical kappa-type statistics in the assessment of majority agreement among multiple observers. Biometrics, 33 (2), 363-374.
Likert, R. (1932). A Technique for Measurement of Attitudes (First Edition). New York University Archives of Psychology : New York.
Meyers, L. S., Glenn G. & Guarino, A. J. (2013). Applied Multivariate Research Design and Interpretation (Second Edition). California: Sage.
Nichols, D. (8 Mart 2013). FAQ / kappa /multiple. 7 Şubat 2018, https://imaging.mrc-cbu.cam.ac.uk/statswiki/FAQ/kappa/multiple.
Nying, E. (2004). A comparative study of interrater reliability coefficients obtained from different statistical procedures using monte carlo simulation tecniques. Doctoral Dissertation. Available from Proquest Dissertations and Theses database. (UMI No. 3138768).
Pillai, K. C. S. (1955). Some new test criteria in multivariate analysis. The Annals of Mathematical Statistics, 26 (1), 117–21.
Raykov, T., Dimitrov, D. M., von Eye, A. & Marcoulides, G. A. (2012). Interater Agreement Evaluation: a latent variable modeling approach. Educational and Psychological Measurement, 20 (10). 1-20.
Saito, Y., Sozu, T., Hamada, C. & Yoshimura, I. (2006). Effective number of subjects and number of raters for inter-rater reliability studies, Statistcs in Medicine, 25, 1547-1560.
Shrout, P. E. & Fleiss, J. L. (1979). Intraclass correlations: uses in assessing rater reliability. Psychological Bulletin, 86 (2), 420–428.
Tabachnick, B. G. & Fidel, L. S. (2013). Using Multivariate Statistics (Sixth Edition). Pearson: New Jersey.
ten Hove, D., Jorgensen, T. D., & van der Ark, L. A. (2018). On the usefulness of interrater reliability coefficients. Quantitative Psychology: The 82nd Annual Meeting of the Psychometric Society, Zurich, Switzerland, 67-75.
von Eye, A. & Mun, E. Y. (2005). Analyzing rater agreement manifest variable methods (First Edition). Mahwah, New Jersey London: Lawrence Erlbaum Associates.
Yarnold, P. R. (2016). ODA vs. π and κ: paradoxes of kappa, Optimal Data Analysis, 5, 160-161.
Yıldıztekin, B. (2014). Klasik test kuramı ve genellenebilirlik kuramından puanlayıcılar arası tutarlığın farklı yöntemlere göre karşılaştırılması. Yüksek Lisans Tezi. Hacettepe Üniversitesi Eğitim Bilimleri Enstitüsü Eğitim Bilimleri Ana Bilim Dalı Eğitimde Ölçme ve Değerlendirme Bilim Dalı, Ankara.
Zapf, A., Castell, S., Morawietz, L. & Karch, A.(2016). Measuring inter-rater reliability for nominal data – which coefficients and confidence intervals are appropriate?. Medical Research Methodology, 16 (93), 1-10.
Zhao, X., Feng, G. C., Liu J. S. & Deng K. (2018). We agreed to measure agreement redefining reliability dejustifes Krippendorff’s alpha. China Media Research, 14 (2), 1-15.