Fen Başarısındaki Artışın Belirlenmesinde Madde Tepki Kuramına Dayalı Dikey Ölçekleme Yöntemlerinin Karşılaştırılması

Bu araştırmada Madde Tepki Kuramına dayalı dikey ölçekleme çalışması yürütülmüş, kalibrasyon yöntemleri ve yetenek kestirim yöntemleri kullanarak elde edilen dikey ölçekleme sonuçları karşılaştırılmıştır. Elde edilen dikey ölçekler, bir sınıf düzeyinden diğer sınıf düzeyine olan büyüme, sınıf düzeyleri arasındaki çeşitlilik ve düzey dağılımlarının ayrımı kriterlerine göre değerlendirilmiştir. Çalışmanın verileri Ankara ili farklı sosyoekonomik kültüre sahip on iki ilköğretim okulundan toplam 1500 öğrenciden toplanmıştır. Birinci ve ikinci alt probleme ait elde edilen bulgular karşılaştırıldığında, ayrı kalibrasyon ile elde edilen ortalama farkların eş zamanlı kalibrasyon ile elde edilen ortalama farklarından daha düşük olduğu, ayrı kalibrasyon ile elde edilen standart sapma değerlerinin genel olarak eş zamanlı kalibrasyon ile elde edilen değerlere göre daha düşük olduğu ve ayrı kalibrasyon ile elde edilen etki büyüklüğü değerlerinin eş zamanlı kalibrasyon ile elde edilen değerlere göre daha düşük olduğu görülmektedir. Eş zamanlı kalibrasyon yöntemi ile her üç kriter için de elde edilen sonuçların ML < MAP < EAP şeklinde sıralandığı; en küçük değerlerin ML, en büyük değerlerin ise EAP ile elde edildiği görülmektedir. Ayrı kalibrasyon da ise sonuçların sıralamalarının kullanılan kriterlere göre farklılaştığı görülmektedir

A Comparison of IRT Vertical Scaling Methods in Determining the Increase in Science Achievement

This study is based on a vertical scaling implemented with reference to the Item Response Theory, and involves a comparison of vertical scaling results obtained through the application of proficiency estimation methods and calibration methods. The vertical scales thus developed were assessed with reference to the criteria of grade-to-grade growth, grade-to-grade variability, and the separation of grade distributions. The data used in the study pertains to a dataset composed of a total of 1500 students from twelve primary schools in the province of Ankara, characterized by different levels of socio-economic cultural development. The comparison of the findings pertaining to the first and the second sub-problems reveals that the mean differences found through separate calibration were lower than those applicable to concurrent calibration, while the standard deviation found in the case of separate calibration were again lower than the values established through concurrent calibration. Furthermore, the scale of impact in the case of separate calibration was again lower than the values applicable to concurrent calibration. The results reached for all three criteria, using the concurrent calibration method were ranked in the order ML < MAP < EAP, with ML leading to the lowest value while EAP producing the highest one. In case of separate calibration, on the other hand, the ranking of results was found to vary with reference to the criteria applied


  • Altun, A. (2013). Dikey ölçeklemede madde tepki kuramına dayalı farklı kalibrasyon ve yetenek kestirim yöntemlerinin karşılaştırılması (Unpublished Doctoral Thesis). Ankara: Hacettepe University.
  • Briggs, D. C., Weeks, J. P., & Wiley, E. (2008, April). Vertical scaling in value-added models for student learning. Paper presented at the National Conference on Value-Added Modeling, Madison, WI.
  • Boughton, K. A., Lorie, W., & Yao, L. (2005). A multidimensional multi-group IRT models for vertical scales with complex test structure: An empirical evaluation of student growth using real data. Paper presented at the annual meeting of the National Council on Measurement in Education, Monreal, Canada.
  • Creswell, J. W. (2013). Research design: Qualitative, quantitative and mixed methods approaches (4th edition). University of Nebraska, Lincoln: Sage.
  • Cetin, E. (2009). Dikey ölçeklemede klasik test ve madde tepki kuramına dayalı yöntemlerin karşılaştırılması (Unpublished Doctoral Thesis). Ankara: Hacettepe University.
  • Dongyang, L. (2009). Developing a common scale for testlet model parameter estimates under the commonitem nonequivalent groups design (Unpublished Doctoral Thesis). University of Maryland.
  • Hambleton, R. K., & Swaminathan, H. (1985). Item response theory: Principles and applications. Boston: Kluwer.
  • Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response theory. Newbury Park, CA: Sage.
  • Hanson, B. A., & Béguin, A. A. (2002). Obtaining a common scale for item response theory item parameters using separate versus concurrent estimation in the common-item equating design. Applied Psychological Measurement, 26, 3-24.
  • Hanson, B. A., Zeng, L., & Chien, Y. (2004). ST: A computer program for IRT scale transformation [Computer software]. Retrieved January 24, 2005, from http://www.education.uiowa.edu/casma.
  • Harris, D. J. (2003). Equating the multistate bar examination. The Bar Examiner, 72(3), 12-18.
  • Holland, P. W., & Dorans, N. J. (2006). Linking and equating. In R. L. Brennan (Ed.), Educational measurement (pp. 187–220). Westport, CT: Praeger. Ito, K., Sykes, R. C., & Yao, L. (2008). Concurrent and separate grade-groups linking procedures for vertical scaling. Applied Measurment in Education, 21, 187-206.
  • Karkee, T. B. & Wright, K. R. (2004). Evaluation of linking methods for placing three-parameter logistic item parameter estimates onto a one-parameter scale. Paper presented at the Annual Meeting of the American Educational Research Association, San Diego, California.
  • Kim, J. (2007). A comparison of calibration methods and proficiency estimators for creating IRT vertical scales (Unpublished Doctoral Thesis). University of Iowa.
  • Kim, S., & Kolen, M. J. (2006). Robustness to format effects of IRT linking methods for mixed-format tests. Applied Measurement in Education, 19(4), 357-381.
  • Kim, J., Lee, W. C., Kim, D., & Kelley, K. (2009). Investigation of vertical scaling using the Rasch model. Paper presented at the annual meeting of the National Council on Measurement in Education, San Diego, CA.
  • Kolen, M. J., & Brennan, R. L. (2004). Test equating, scaling, and linking: Methods and practices (2nd Ed.) New York: Springer Verlag.
  • Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test scores. Reading, MA: Addison Wesley.
  • Loyd, B. H., & Hoover, H. D. (1980). Vertical equating using the Rasch model. Journal of Educational Measurement, 17, 179-193.
  • McBridge, J., & Wise, L. (2001). Developing the vertical scale for the Florida comprehensive assessment test (FCAT). Paper presented at the annual meeting of the Harcourt Educational Measurement, San Antonio, Texas.
  • Meng, H (2007). A comparison study of IRT calibration methods for mixed-format tests in vertical scaling. (Unpublished Doctoral Thesis). University of Iowa, Iowa.
  • Meng, H., Kolen, M. J., & Lohman, D. (2006). An empirical investigation of IRT scaling methods: How different IRT models, parameter estimation procedures, proficiency estimation methods, and estimation programs affect the results of vertical scaling for the cognitive abilities test. Paper presented at the annual meeting of the National Council on Measurement in Education, San Francisco, CA.
  • Nandakumar, R. (1994). Assessing dimensionality of a set of item responses: Comparison of different approaches. Journal of Educational Measurement, 31(1), 17-35.
  • Schermelleh-Engel, K., Moosbrugger, H., & Müller, H. (2003). Evaluating the fit of structural equation models: Test of significance and descriptive goodness-of-fit measures. Methods of Psychological Research Online, 8(2), 23-74.
  • Sinharay, S., & Holland, P. W. (2007). Is it necessary to make anchor tests mini versions of the tests being equated or can some restrictions be relaxed? Journal of Educational Measurement, 44(3), 249-275.
  • Tong, T. (2005). Comparison of methodologies and results in vertical scaling for educational achievements tests (Unpublished Doctoral Thesis). University of Iowa, Iowa.
  • Tong, Y., & Kolen, M. (2007). Comparison of methodologies and results in vertical scaling for educational achievement tests. Applied Measurement in Education, 20(2), 227-253.
  • Tong, Y., & Kolen, M. (2008, March). Maintenance of vertical scales. Paper presented at the National Council on Measurement in Education, New York City.
  • Tong, Y., & Kolen, M. (2010). Scaling: An ITEMS module. Educational Measurement: Issues and Practice, 29(4), 39-48
  • von Davier, A. A., Holland, P. W., & Thayer, D. T. (2004). The Kernel method of test equating. New York: Springer.
  • von Davier, A. A., & Wilson, C. (2008). Investigating the population sensitivity assumption of Item Response Theory true-score equating across two subgroups of examinees and two test formats. Applied Psychological Measurement, 32(1), 11-26.
  • Wysel, A. E., & Reckase, M. D. (2011). A graphical approach to evaluating equating using test characteristic curves. Applied Psychological Measurement, 35(3) 217–234.
  • Yen, W. M. (1984). Obtaining maximum likelihood trait estimates from number-correct scores for the threeparameter logistic model. Journal of Educational Measurement, 21, 93-111.
  • Zhu, W. (1998). Test equating: What, why, who? Research Quarterly for Exercise and Sport, 69(1), 11–23.