Test Eşitlemede Çok Boyutluluğun Eş Zamanlı ve Ayrı Kalibrasyona Etkisi

Bu araştırmanın amacı, çok boyutluluğun eş zamanlı ve ayrı kalibrasyon yapılarak elde edilen eşitlenmiş puanlara etkisini incelemektir. Araştırma simülasyon verileri kullanılarak yürütülmüştür. Araştırma kapsamında [5 (boyutluluk düzeyi: 0.90, 0.75, 0.50, 0.25 ve 0.00) x 2 (kalibrasyon yöntemi: eş zamanlı ve ayrı kalibrasyon) x 2 (ölçek dönüştürme yöntemi: Stocking-Lord ve Haebara) x 2 (test eşitleme yöntemi: Madde Tepki Kuramı gerçek puan eşitleme ve gözlenen puan eşitleme) ] olmak üzere toplam 40 koşul incelenmiştir. Çok boyutluluk testlerin Ɵ1 ve Ɵ2 olmak üzere farklı iki yeteneği ölçtüğü varsayılarak oluşturulmuştur. İki yetenek arasındaki korelasyonun değeri düştükçe çok boyutluluğun derecesi artmaktadır. İki yetenek arasındaki korelasyonun 0.90 olduğu koşul çok boyutluluğun derecesinin en düşük, iki yetenek arasındaki korelasyonun 0.00 olduğu koşul çok boyutluluğun derecesinin en yüksek olduğu koşulu temsil etmektedir. Eşdeğer olmayan gruplar ortak test deseni altında testler birbirine eşitlenmiştir. Elde edilen eşitlenmiş puanlar yanlılık, standart sapma ve RMSE ölçütleri kullanılarak değerlendirilmiştir. Araştırma bulguları, tüm koşullarda eş zamanlı kalibrasyon yapılarak elde edilen eşitleme sonuçlarının ayrı kalibrasyon ile elde edilenlere göre genel olarak daha yanlı ve daha fazla eşitleme hatasına sahip olduğunu göstermiştir. Standart sapma ölçütüne göre ise çok boyutluluğun derecesinin düşük olduğu koşullarda eş zamanlı kalibrasyon ve ayrı kalibrasyon yapılıp Stocking-Lord ve Haebara ölçek dönüştürme yöntemleri ile elde edilen eşitleme sonuçları benzer performans göstermiştir fakat çok boyutluluğun derecesinin ciddi olduğu koşullarda en az tesadüfi hataya sahip eşitleme sonuçlarının eş zamanlı kalibrasyon yapılarak elde edildiği görülmüştür.

The Effect of Multidimensionality on Concurrent and Separate Calibration in Test Equating

The aim of this study is to investigate the effects of multidimensionality on equating results which obtained from separate and concurrent calibration methods. The study was conducted with using simulated data. In the scope of research, totally 40 simulation conditions [5 (degree of multidimensionality: 0.90, 0.75, 0.50, 0.25, and 0.00) x 2 calibration methods (separate and concurrent) x 2 (scale transformation methods: Stocking-Lord and Haebara) x 2 (test equating methods: IRT true score equating and observed score equating)] were examined. Multidimensionality was constructed as assuming the two test forms measuring Ɵ1 and Ɵ2 abilities. While the simulation condition which has correlation between abilities 0.90 represents weak multidimensional case, the correlation between abilities 0.00 represents the severe multidimensional case. Tests were equated under common-item non-equivalent groups design. Equating results were evaluated by using bias, standard deviation and RMSE evaluation criteria. The results showed that, generally under all conditions equating results provided from concurrent calibration more biased and had higher RMSE values than equating results provided by separate calibration. Based on standard deviation criteria, when the degree of multidimensionality was low, equating results which got from concurrent calibration and separate calibration with Stocking-Lord or Haebara scale transformation methods showed similar performance but when the degree of multidimensionality was severe equating results which had lowest random error were provided by concurrent calibration.

PDF

___

Ackerman, T. A. (1989). Unidimensional IRT calibration of compensatory and noncompensatory items. Applied Psychological Measurement, 13, 113-127. [Available online at: https://conservancy.umn.edu/bitstream/handle/11299/107494/v13n2p113.pdf?sequence=1&isAllowed=y], Retrieved on December 12, 2017.
Ackerman, T. A. (1994). Creating a test information profile for a two-dimensional latent space. Applied Psychological Measurement, 18(3), 257-275. [Available online at: https://conservancy.umn.edu/bitstream/handle/11299/117004/v18n3p257.pdf;sequence=1], Retrieved on December 10, 2015.
Ackerman, T. A. (1996). Graphical representation of multidimensional item response theory analyses. Applied Psychological Measurement, 20(4), 311-329. [Available online at: https://conservancy.umn.edu/bitstream/handle/11299/119465/v20n4p311.pdf?sequence=1&isAllowed=y ], Retrieved on December 12, 2015.
Albayrak-Sarı, A. & Kelecioğlu, H. (2017). A comparison of IRT vertical scaling methods determining the increase in science achievement. Eğitimde ve Psikolojide Ölçme ve Değerlendirme Dergisi, 8(1), 98-111. [Çevrim-içi: http://dergipark.gov.tr/epod/issue/28110/286221], Erişim tarihi: 11 Ekim 2018.
Altun, A. & Kelecioğlu, H. (2016). Dikey ölçeklemede madde tepki kuramına dayalı kalibrasyon ve yetenek kestirim yöntemlerinin karşılaştırılması. Hacettepe Üniversitesi Eğitim Fakültesi Dergisi, 31(3), 447-460. [Çevrim-içi: http://www.efdergi.hacettepe.edu.tr/yonetim/icerik/makaleler/2171-published.pdf ], Erişim tarihi: 3 Ocak 2018.
Beguin, A. A., & Hanson, B. A. (2001, April). Effect of noncompensatory multidimensionality on separate and concurrent estimation in IRT observed score equating. Paper presented at the The Annual Meeting of the National Council on Measurement in Education, Seattle, WA.
Béguin, A. A., Hanson, B. A., & Glas, C.A.W. (2000, April). Effect of multidimensionality on separate and concurrent estimation in IRT equating. Paper presented at the Annual Meeting of the National Council on Measurement in Education, New Orleans, LA. [Available online at: http://www.bh.com/papers/paper0002.html], Retrieved on December 28, 2015.
Hanson, B. A., & Beguin, A.A. (2002). Obtaining a common scale for item response theory item parameters using separate versus concurrent estimation in the common-item equating design. Applied Psychological Measurement, 26(1), 3-24. [Available online at: https://journals.sagepub.com/doi/10.1177/0146621602026001001 , Retrieved on December 28, 2015.
Huggins, A.C. (2012). The effect of differential item functioning on population invariance of item response theory true score equating. Unpublished doctoral dissertation, University of Miami.
Kang, T., & Petersen, N. (2009). Linking item parameters to a base scale. ACT Research Report Series 2009-2, Iowa City, IA: ACT, Inc.
Kim, S. (2004). Unidimensional IRT scale linking procedures for mixed-format tests and their robustness to multidimensionality. Unpublished doctoral dissertation, University of Iowa, Iowa City.
Kim, S., & Cohen, A. S. (1998). A comparison of link-ing and concurrent calibration under item response theory. Applied Psychological Measurement, 22(2),131-143. [Available online at: https://journals.sagepub.com/doi/10.1177/01466216980222003], Retrieved on December 28, 2015.
Kim, S., & Kolen, M. J. (2004). STUIRT: A computer program for scale transformation under unidimensional item response theory models [Computer Software]. Iowa City: IA: The Center for Advanced Studies in Measurement and Assessment, The University of Iowa.
Kolen, M.J. (2004). POLYEQUATE [computer program].Iowa City,IA: The Center for Advanced Studies in Measurement and Assessment (CASMA), The University of Iowa.
Kolen, M. J., & Brennan, R. L. (1995).Test equating methods and practices. New York: Springer-Verlag.
Kolen, M.J., & Brennan, R.L. (2004). Test equating: Methods and practices (2nd ed.).New York, NY: Springe- Verlag.
Min, K-S. (2007). Evaluation of linking methods for multidimensional IRT calibrations. Asia Pacific Education Review, 8(1), 41-45. [Available online at: https://link.springer.com/article/10.1007/BF03025832 ] Retrieved on December 12, 2017.
Petersen, N. S., Cook, L. L., & Stocking, M. L.(1983). IRT versus conventional equating methods: A comparative study of scale stability. Journal of Educational Statistics, 8(2), 137-156. [Available online at: https://www.jstor.org/stable/1164922?seq=1/analyze ] Retrieved on March 12, 2014.
Reckase, M. D. (2009). Multidimensional item response theory. New York: Springer.
Reckase, M. D., Ackerman, T. A., & Carlson, J. E. (1988). Building a unidimensional test using multidimensional items. Journal of Educational Measurement, 25(3), 193-203. [Available online at: https://www.jstor.org/stable/1434499?seq=1#page_scan_tab_contents ] Retrieved on October 18, 2018.
Revelle, W. (2018). Procedures for psychological, psychometric, and personality. R package version 1.8.4.
Sinharay, S., & Holland, P. W. (2007). Is it necessary to make anchor tests mini-versions of the tests being equated or can some restrictions be relaxed? Journal of Educational Measurement, 44(3), 249-275. [Available online at: https://www.jstor.org/stable/20461859?seq=1#page_scan_tab_contents ] Retrieved on October 3, 2015.
Stocking, M. L., & Lord, F. M. (1983). Developing a common metric in item response theory. Applied Psychological Measurement, 7(2), 201-210.
Stout, W. (1987). A nonparametric approach for assessing latent trait unidimensionality. Psychometrika, 52(4), 589- 617.
Wingersky, M. S., Cook, L. L., & Eignor, D. R.(1987). Specifying the characteristics of linking items used for item response theory item calibration. ETS Research Report 87-24. Princeton, NJ: Educational Testing Service.
Yao, L. (2003). SimuMIRT [Computer Software]. Monterey, CA: Defense Manpower Data Center.
Yao, L., & Schwarz, R.D. (2006). A multidimensional partial credit model with associated item and test statistics: An application to mixed-format tests. Applied Psychological Measurement, 30, 469-492.
Zhang, B. (2009). Application of unidimensional item response models to tests with item sensitive to secondary dimensions. The Journal of Experimental Education, 77(2), 147-166.
Zimowski, M. F., Muraki, E., Mislevy, R. J., & Bock, R. D. (1996). BILOG-MG: Multiple group IRT analysis and test maintenance for binary items [Computer program]. Chicago: Scientific Software International.