Madde Tepki Kuramında Eşitleme Hatalarının Belirlenmesinde Kullanılan Delta ve Bootstrap Yöntemlerinin Çeşitli Değişkenlere Göre İncelenmesi*

Alan yazın incelendiğinde test eşitleme ve faydaları ile ilgili birçok çalışmanın olduğu görülmektedir. Ancak test eşitlemelerin faydalarının yanı sıra sınırlılıkları da vardır. Bunlardan en bilineninin eşitleme hataları olduğu söylenebilir. Bu araştırmada TIMMS 2015 4. sınıf matematik verisi kullanılarak bootstrap ve delta yöntemleri ile elde edilen Madde Tepki Kuramı gözlenen ve gerçek puan eşitleme hatalarının farklı örneklem büyüklükleri ve ölçek dönüştürme yöntemlerine göre incelenmesi amaçlanmıştır. Çalışma bir yöntemin daha kapsamlı ve farklı değişkenler açısından incelenmesi yönüyle betimsel bir çalışmadır. Çalışmada birey yetenek düzeylerinin yakın olması amaçlanarak, TIMMS başarı ölçeğinin orta noktasının (500) üstünde ve altında yer alan ülkelerin (Avusturalya, Kanada, İtalya, İspanya, Hırvatistan, Slovak Cumhuriyeti, Yeni Zelanda, Türkiye ve Gürcistan) verisi içinden rastgele olarak seçilmiş 500, 1000 ve 3000 kişilik örneklemler kullanılmıştır. TIMMS 2015 sınavında kullanılan 14 kitapçık türünden hangisinin kullanılacağının belirlenmesi amacıyla Madde Tepki Kuramı varsayımları incelenmiş ve kitapçıklar arasından model uyum indekslerinin hepsinin kabul edilebilir düzeyde olduğu kitapçık çifti seçilmiştir. Yapılan analizler sonucunda genel olarak her iki yöntemde de en düşük hata değerlerinin Stocking Lord yönteminde elde edildiği ve delta yöntemiyle elde edilen hataların tüm örneklem büyüklüğünde bootstrap yönteminde elde edilen hatalardan yüksek olduğu bulunmuştur.

Anahtar Kelimeler:

Eşitlemenin standart hatası, Delta yöntemi, Bootstrap

Investigation of Delta and Bootstrap Methods for Calculating Error of Test Equation in IRT in Terms of Some Variables

After investigation of the literature, it can be seen that there are lots of studies about test equating and its benefits. However, besides its good sides, it has some limitations, and it can be said that the most known one of them is test equating error. In this study, it is aimed the examination of Item Response Theory observed and true score equating errors, obtained using TIMMS 2015 4th Grade math data by bootstrap and delta methods, according to different sample sizes and scale transformation methods. The study is a descriptive research in terms of investigating one method in a more detailed way, and according to different variables. It is used randomly chosen 500, 1000, and 3000 sized samples from the countries (Australia, Canada, Italia, Spain, Croatia, Slovak republic, New Zealand, Turkey, and Georgia) above and below center point (500) of TIMMS success scale. To determine which one would be used among 14 types of TIMMS 2015 booklets, it was controlled Item Response Theory assumptions, and it was chosen the pair of booklets, having all acceptable model fit indices. As results of the analysis, it was observed that for both methods, scale transformation method, having smallest equating errors, was Stocking Lord, and that for all sample sizes, errors estimated by bootstrap method were smaller than one by delta.

Keywords:

Standard error of equating, Delta method, Bootstrap,

PDF

___

Aksekioğlu, B. (2017). Madde tepki kuramına dayalı test eşitleme yöntemlerinin karşılaştırılması: PISA 2012 fen testi örneği (Yayınlanmamış yüksek lisans tezi). Akdeniz Üniversitesi, Eğitim Bilimleri Enstitüsü. Mersin. Retrieved from http://acikerisim.akdeniz.edu.tr/xmlui/handle/123456789/3143.
Angoff, W. H. (1971). Scales, norms and equivalent scores. In R. L. Thorndike (Ed.), Educational measurement (pp. 508-600). Washington: American Council on Education.
Akın-Arıkan, Ç. (2017). Kernel eşitleme ve madde tepki kuramına dayalı eşitleme yöntemlerinin karşılaştırılması, (Yayınlanmamış doktora tezi). Hacettepe Üni&rsitesi, Eğitim Bilimleri Enstitüsü. Ankara. Retrieved from http://www.openaccess.hacettepe.edu.tr:8080/xmlui/handle/11655/3902
Baker, F.B., & Al-Karni, A. (1991). A Comparison of Two Procedures for Computing IRT Equating Coefficients. Journal of Educational Measurement, 28 (2), 147- 162, DOI: 10.1111/j.1745-3984.1991.tb00350.x
Battauz, M. (2013). IRT test equating in complex linkage plans. Psychometrika, 78(3), 464–480. DOI:10.1007/s11336- 012-9316-y
Battauz, M. (2015). equateIRT: An R package for IRT test equating. Journal of Statistical Software, 68(1), 1-22. Retrieved from https://www.jstatsoft.org/article/view/v068i07
Bozdağ, S., & Kan, A. (2010). Şans başarısının test eşitlemeye etkisi. Hacettepe Üniversitesi Eğitim Fakültesi Dergisi, 39(39), 91-108. Retrieved from https://dergipark.org.tr/en/pub/hunefd/issue/7799/102166
Büyüköztürk, Ş. (2017). Sosyal bilimler için veri analizi el kitabı (23. Baskı). Ankara: Pegem Akademik Yayıncılık.
Büyüköztürk, Ş., Çakmak, K. E., Akgün, E. Ö., Karadeniz, Ş., & Demirel, F. (2014). Bilimsel araştırma yöntemleri. Ankara: Pegem Akademi.
Byrne, B. M. (2013). Structural equation modeling with Mplus: Basic concepts, applications, and programming. routledge. DOI: 10.4324/9780203807644
Cook, L. L., & Eignor, D. R. (1991). IRT equating methods. Educational Measurement: Issues and Practice, 10 (3), 37–45, DOI: 10.1111/j.17453992.1991.tb00207.x
Çağlak,S. (2015). The use of a Meta-Analysıs Technıque in Equatıng and Its Comparıson wıth Several Small Sample Equatıng Methods. Doktora Tezi, Florida Universitesi,USA.Retrieved from https://www.proquest.com/docview/1759157142?pqorigsite= gscholar&fromopenview=true
Demir, S., & Güler, N. (2014). Study of test equating on the common item nonequivalent group design Ortak maddeli denk olmayan gruplar desenine ilişkin test eşitleme çalışması. Journal of Human Sciences, 11(2), 190-208. Retrieved from https://www.jhumansciences.com/ojs/index.php/IJHS/article/view/2870
Demirus, K. B., & Gelbal, S. (2016). Ortak maddelerin değişen madde fonksiyonu gösterip göstermemesi durumunda test eşitlemeye etkisinin farklı yöntemlerle incelenmesi. Journal of Measurement and Evaluation in Education and Psychology, 7(1), 182-201. DOI: 10.21031/epod.56218
Efron, B., & Tibshirani, R. J. (1994). An introduction to the bootstrap. CRC press.
Embretson, S. E. V., & Reise, S. P. (2000). Item response theory for psychologists. New Jersey: Lawrence Erlbaum Associates, Inc.
Felan, G. D. (2002). Test equating: Mean, linear, equipercentile and item response theory. Paper presented at the annual meeting of the Southwest Educational Research Association (Austin, TX, February 14-16, 2002).
Gök,B. (2012). Denk olmayan gruplarda ortak madde deseni kullanılarak madde tepki kuramına dayalı eşitleme yöntemlerinin karşılaştırılması (Yayımlanmamış Doktora tezi). Hacettepe Üniversitesi, Eğitim Bilimleri Enstitüsü, Ankara. Retrieved from http://www.openaccess.hacettepe.edu.tr:8080/xmlui/bitstream/handle/11655/1767/c8ab915b-7862-4913-a87f-4195e382ef66.pdf?sequence=1
Gök, B. , & Kelecioğlu, H. (2014). Denk olmayan gruplarda ortak madde deseni kullanılarak madde tepki kuramına dayalı eşitleme yöntemlerinin karşılaştırılması. Mersin Üniversitesi Eğitim Fakültesi Dergisi, 10(1), 120-136. Retrieved from https://dergipark.org.tr/en/pub/mersinefd/issue/17393/181786?publisher=mersin?publisher=mersin?publisher=mersin?publisher=mersin
Gül, E., Gül, Ç. D., Bökeoğlu, Ö. Ç., & Özkan, M. (2017). Temel Eğitimden Ortaöğretime Geçiş Matematik Alt Testi Asıl Sınav Ve Mazeret Sınavlarının Madde Tepki Kuramına Göre Eşitlenmesi. Abant İzzet Baysal Üniversitesi Eğitim Fakültesi Dergisi, 17(4), 1900-1915. DOI: 10.17240/aibuefd.2017.17.32772-363973
Hambleton, R. K., Swaminathan, H. & Rogers, H. J. (1991). Fundamentals of Item Response Theory. USA: Sage.
Hanson, B. A., & Beguin, A. A. (2002). Obtaining a common scale for item response theory item parameters using seperate versus concurrent estimation in the common-item equating design. Applied Psychological Measurement, 26 (3), 3- 24. DOI: 10.1177%2F0146621602026001001
Kilmen, S. (2010). Madde tepki kuramına dayalı test eşitleme yöntemlerinden kestirilen eşitleme hatalarının örneklem büyüklüğü ve yetenek dağılımına göre karşılaştırılması. Yayımlanmamış Doktora Tezi, Ankara: Ankara Üniversitesi Eğitim Bilimleri Enstitüsü. Retrieved from https://dspace.ankara.edu.tr/xmlui/handle/20.500.12575/34147
Kim, S., & Kolen, M. J. (2007). Effects on scale linking of different definitions of criterion functions for the IRT characteristic curve methods. Journal of Educational and Behavioral Statistics, 32(4), 371-397. DOI: 10.3102/1076998607302632
Kolen, M. J. (1988). Traditional Equating Methodology. Educational Measurement: Issues and Practice, 7 (4), 29-36. DOI: 10.1111/j.1745-3992.1988.tb00843.x
Kolen, M. J., & Brennan, R. L. (2004). Test equating, scalling and linking (SecondEdition). USA: Springer.
Li, D., Jiang, Y., & von Davier, A. A. (2012). The Accuracy and Consistency of a Series of IRT True Score Equatings. Journal of Educational Measurement, 49(2), 167–189. Retrieved from http://www.jstor.org/stable/41653582
Livingstone, S. A. (2004). Equating test scores (Without IRT). Educational Testing Service. Retrieved from https://www.ets.org/Media/Research/pdf/LIVINGSTON.pdf
Lord, F. M. (1980). Applications of Item Response Theory to Practical Testing Problems. Hillsdale, NJ: Erlbaum.
Loyd, B. H., & Hoover, H. D. (1980). Vertical equating using the rasch model. Journal of Educational Measurement, 17(3), 179–193. DOI:10.1111/j.1745- 3984.1980.tb00825.x
MEB (2011). TIMSS 2007 ulusal matematik ve fen raporu 8. Sınıflar. Ankara: Hermes. Retrieved from https://timss.meb.gov.tr/www/raporlar/icerik/3
Michaelides, M. P. (2003). Effects of common-item selection on the accuracy of item response theory test equating with nonequivalent groups. Stanford University. Retrieved from https://www.proquest.com/docview/305304806?pq - origsite=gscholar&fromopenview=true
Muthén, L. K., & Muthén, B. O. (2010). 1998–2010 Mplus user‘s guide. Muthén and Muthén, 39-49. Retrieved from https://www.statmodel.com/download/usersguide/MplusUserGuideVer_8.pdf
Ogasawara, H. (2001). Standard Errors of Item Response Theory Equating/Linking by Response Function Methods. Applied Psychological Measurement, 25 (1), 53– 67. DOI: 10.1177%2F01466216010251004
Öztürk, N., & Anıl, D. (2012). Akademik personel ve lisansüstü eğitimi giriş sınavı puanlarının eşitlenmesi üzerine bir çalışma. Eğitim ve Bilim, 37(165). Retrieved from http://eb.ted.org.tr/index.php/EB/article/view/1141
Popham, W.J. (1999). "Why standardized tests don't measure educational quality". Educational Leadership. 56 (6): 8–15.
Patton, J. M., Cheng, Y., Yuan, K. H., & Diao, Q. (2014). Bootstrap standard errors for maximum likelihood ability estimates when item parameters are unknown. Educational and Psychological Measurement, 74(4), 697–712. DOI: 10.1177/0013164413511083
Revelle, W. (2012). Procedures for psychological, psychometric, and personality research. Acesso em, 9. Retrieved from https://cran.rproject. org/web/packages/psych/psych.pdf
Schermelleh-Engel, K., Moosbrugger, H., & Müller, H. (2003). Evaluating the fit of structural equation models: Tests of significance and descriptive goodness-offit measures. Methods of psychological research online, 8(2), 23-74. Retrieved from-http://citeseerx.ist.psu.edu/viewdoc/download doi=10.1.1.509.4258&rep=rep1 &type=pdf
Tekin, H. (2009). Eğitimde ölçme ve değerlendirme. Ankara: Yargı.
Tsai, T. H., Hanson, B. A., Kolen, M. J., & Forsyth, A. R. A. (2001). A comparison of bootstrap standard errors of IRT equating methods for the common-item nonequivalent groups design. Applied Measurement in Education, 14(1), 17–30. DOI:10.1207/S15324818AME1401_03
Turgut, M. F. , & Baykul, Y. (2012). Eğitimde ölçme ve değerlendirme (Dördüncü Baskı). Ankara: Pegem Akademi.
Uyar, Ş., Aksekioğlu, B., & Öztürk Gübeş, N. (2020). Comparison of Different Scale Linking Methods in PISA 2012 Mathematics Literacy Test. Mehmet Akif Ersoy Üniversitesi Eğitim Fakültesi Dergisi, 46, 121-148. Retrieved from https://acikerisim.mehmetakif.edu.tr/xmlui/handle/11672/3258.
Velhelst, N. D. (2004). Reference Supplemented to The Preliminary Pilot Version of The Manuel for Relating Language Examinations to the Common European Framework Of Reference for Languages: Learning, Teaching, Assessment, Section G: Item Response Theory. DGIV/EDU/LANG (2004) 13, Council of Europe Language Policy Division, Strasbourg.
Way, W. D., & Tang, K. L. (1991). A Comparison of Four Logistic Model Equating Methods. Paper presented at the Annual Meeting of the American Educational Research Association, Chicago, IL.
Zeng, L. (1991). Standard Errors of Linear Equating for the Single-Group Design. Retrieved from https://www.act.org/content/dam/act/unsecured/documents/ACT_RR91-04.pdf
Zhang, Z. (2020). Estimating standard errors of IRT true score equating coefficients using imputed item parameters. The Journal of Experimental Education, 1-23. DOI: 10.1080/00220973.2020.1751579