Comparison of Item Response Theory Scaling Methods withROC Analysis

Comparison of Item Response Theory Scaling Methods withROC Analysis

In this study, one-dimensional item response theory models were evaluated using different scaling methods. In this context, the equating errors and the area under the curve of four scaling methods (Stocking-Lord, Heabara, Mean-Sigma, Mean-Mean), and one, two, and three parameters logistic models (1PL, 2PL, and 3PL) in non-equivalent groups with anchor test (NEAT) design were examined. Additionally, the equating errors of the scaling methods and the results obtained from ROC analysis were compared. Qatar's and Australia's PISA 2012 mathematical literacy test data were used in the study. The minimum error was obtained from the Mean-Mean method with the 1PL model, and the maximum error was obtained from the Mean-Mean method with the 3PL model. Similar results were observed in all comparisons and supported each other. It is concluded that ROC analysis can be used to compare different conditions, methods and models.

___

  • Aşiret, S., & Sünbül, S. Ö. (2016). Investigating test equating methods in small samples through various factors. Educational Sciences: Theory & Practice, 16(2), 647-668. https://doi.org/10.12738/estp.2016.2.2762
  • Baker, F. B., & Al-Karni, A. (1991). A comparison of two procedures for computing IRT equating coefficients. Journal of Educational Measurement, 28, 147-162. https://www.jstor.org/stable/1434796
  • Boduroğlu, E. (2017). The study of classification consistency of transition to higher education examination according to the cut-off scores obtained from different [Master’s Thesis, Mersin University]. https://tez.yok.gov.tr/UlusalTezMerkezi/
  • Branberg, K., & Wiberg, M. (2011). Observed score linear equating with covariates. Journal of Educational Measurement, 48(4), 419-440. https://www.jstor.org/stable/41427533
  • Carrington, A. M., Manuel, D. G., Fieguth, P. W., Ramsay, T., Osmani, V., Wernly, B., ... & Holzinger, A. (2021). Deep ROC analysis and AUC as balanced average accuracy to improve model selection, understanding and interpretation. IEEE Transactions on Pattern Analysis and Machine Intelligence. https://doi.org/10.1109/TPAMI.2022.3145392
  • Çokluk, Ö., Şekercioğlu, G., & Büyüköztürk, Ş. (2012). Sosyal bilimler için çok değişkenli istatistik: SPSS ve Lisrel uygulamaları. Pegem Akademi.
  • Embretson, S. (1996) The new rules of measurement. Psychological Assessment, 8(4), 341-349. https://doi.org/10.1037/1040-3590.8.4.341
  • Embretson, S. (1997). Multicomponent response models. In W. van der Linden & R. Hambleton (Eds.), Handbook of modern Item Response Theory (pp. 305-321). Springer-Verlag.
  • Embretson, S. E., & Reise, S. P. (2000). Item Response Theory for psychologists. Lawrence Elbaum Associates.
  • Faraggi, D., & Reiser, B. (2002). Estimation of the area under the ROC curve. Statistics in Medicine, 21, 3093-3106. https://doi.org/10.1002/sim.1228
  • Flach, P., Blockeel, H., Ferri, C., Hernandez-Orallo, J., & Struyf, J. (2003). Decision support for data mining: An introduction to ROC analysis and its applications. In D. Mladenić, N. Lavrač, M. Bohanec & S. Moyle (Eds.), Data mining and decision support: Integration and collaboration (vol. 745, pp. 81-90). Springer. https://doi.org/10.1007/978-1-4615-0286-9_7
  • Flach, P. (2019). Performance evaluation in machine learning: the good, the bad, the ugly, and the way forward. Proceedings of the AAAI Conference on Artificial Intelligence, 33, 9808–9814. https://doi.org/10.1609/aaai.v33i01.33019808
  • Gao, X., Zhu, R., Chen, H., & Harris, D.J. (2008, March 25-27). Impact of anchor-item selections on IRT scale transformation and equating [Paper presentation]. Annual meeting of the National Council on Measurement in Education, New York.
  • Gialluca, K. A., Crichton, L. I., Vale, C. D., & Ree, M. J. (1984). Methods for equating mental tests (Report No. ED251512). ERIC. https://files.eric.ed.gov/fulltext/ED251512.pdf
  • Gonzalez, J. (2014). SNSequate: Standard and nonstandard statisticalmodels and methods for test equating. Journal of Statistical Software, 59(7), 1-30. https://doi.org/10.18637/jss.v059.i07
  • Haebara, T. (1980). Equating logistic ability scales by a weighted least squares method. Japanese Psychological Research, 22(3), 144-149. https://doi.org/10.4992/psycholres1954.22.144
  • Hajian-Tilaki, K. O., Hanley, J. A., Joseph, L., & Collet, J. P. (1997). A comparison of parametric and nonparametric approaches to ROC analysis of quantitative diagnostic tests. Medical Decision Making, 17(1), 94-102. https://doi.org/10.1177/0272989X9701700111
  • Hajian-Tilaki, K. (2018). Receiver operator characteristic analysis of biomarkers evaluation in diagnostic research. Journal of Clinical and Diagnostic Research, 12(6), 1-8. https://doi.org/10.7860/JCDR/2018/32856.11609
  • Hambleton, R. K., & Swaminathan, H. (1985). Item response theory: Principles and applications. Kuluwer-Nijhoff Publisihing.
  • Hanley J. A., & McNeil, B. J. (1982). The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology, 143, 29–36. https://doi.org/10.1148/radiology.143.1.7063747
  • Hanson, B. A., & Béguin, A. A. (2002). Obtaining a common scale for item response theory item parameters using separate versus concurrent estimation in the common-item equating design. Applied Psychological Measurement, 26(1), 3-24. https://doi.org/10.1177/0146621602026001001
  • Heagerty, P., Lumley, T., & Pepe, M. (2000). Time-dependent ROC curves for censored survival data and a diagnostic marker. Biometrics, 56, 337-344. https://doi.org/10.1111/j.0006-341x.2000.00337.x
  • Hosmer, D. W., & Lemeshow, S. (2000). Applied logistic regression (2nd ed.). John Wiley & Sons
  • Jones, L., & Rushton, D. (2019, September 1-6). Optimising geotechnical correlations using receiver operating characteristic (ROC) analysis. The XVII European Conference on Soil Mechanics and Geotechnical Engineering (ECSMGE 2019), Reykjavik, Iceland.
  • Karaismailoğlu, E. (2015). The use of time dependent roc curve for evaluation of the performance of markers during follow-up time (Combined Doctoral Dissertation, Hacettepe University). https://tez.yok.gov.tr/UlusalTezMerkezi/
  • Kılıç, S. (2013). Klinik karar vermede ROC analizi. Journal of Mood Disorders, 3(3), 135-140. https://doi.org/10.5455/jmood.20130830051624
  • Kolen, M. J. (1988). An NCME instructional module on traditional equating methodology. Educational Measurement: Issues and Practice, 7, 29-36.
  • Kolen, M. J., & Brennan, R. L. (2014). Test equating, scaling, and linking: Methods and practices (3nd ed.). Springer.
  • Köksal, B. (2011). Model selection with ROC curve estimation in regression analysis [Master’s thesis, Marmara University]. https://tez.yok.gov.tr/UlusalTezMerkezi/
  • Lasko, T. A., Bhagwat, J. G., Zou, K. H., & Ohno-Machado, L. (2005). The use of receiver operating characteristic curves in biomedical informatics. Journal of Biomedical Informatics, 38(5), 404-415. https://doi.org/10.1016/j.jbi.2005.02.008
  • Liaw, A., & Wiener, M. (2018). randomForest: Breiman and Cutler's Random Forests for classification and regression. https://cran.r-project.org/web/packages/randomForest/randomForest.pdf
  • Livingston, S. A., & Lewis, C. (2009). Small-sample equating with prior information (Report No. RR-09-25). ETS. https://files.eric.ed.gov/fulltext/ED507811.pdf
  • Pardo, M. C., & Franco-Pereira, A.M. (2017). Non parametric ROC summary statistics. REVSTAT-Statistical Journal, 15(4), 583-600. https://eprints.ucm.es/id/eprint/46564/1/PardoCarmen29.pdf
  • Pepe, M., Janes, H., Longton, G., Leisenring, W., & Newcomb, P. (2004). Limitations of the odds ratio in gauging the performance of a diagnostic, prognostic, or screening marker. American Journal of Epidemiology, 159, 882-890. https://doi.org/10.1093/aje/kwh101
  • Pundir, S., & Amala, R. (2015). Evaluation of biomarker using two parameter bi-exponential ROC curve. Pakistan Journal of Statistics and Operation Research, 11(4), 481-496. https://doi.org/10.18187/pjsor.v11i4.992
  • Revelle, W. (2018). psych: Procedures for personality and psychological research. http://kambing.ui.ac.id/cran/web/packages/psych/psych.pdf
  • Rizopoulos, D. (2018). ltm: Latent Trait Models under IRT. https://cran.r-project.org/web/packages/ltm/ltm.pdf
  • Robin, X., Turck, N., Hainard, A., Tiberti, N., Lisace k F, Sanchez J, Müller M (2011). pROC: An open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinformatics, 12, 77. https://doi.org/10.1186/1471-2105-12-77
  • Senaratna, D. M., Sooriyarachchim, M. R., & Meyen, N. (2015). Bivariate test for testing the EQUALITY of the average areas under correlated receiver operating characteristic curves (Test for comparing of AUC’s of correlated ROC curves). American Journal of Applied Mathematics and Statistics, 3(5), 190-198. https://doi.org/10.12691/ajams-3-5-3
  • Swaving, M., van Houwelingen, H., Ottes, F. P., & Steerneman, T. (1996). Statistical comparison of ROC curves from multiple readers. Medical Decision Making, 16(2), 143-152. https://doi.org/10.1177/0272989X9601600206
  • Swets, J. A., Dawes, R. M., & Monahan, J. (2000). Better decisions through science. Scientific American, 283, 82-87. https://doi.org/10.1038/scientificamerican1000-82
  • Stocking, M. L., & Lord, F. M. (1983). Developing a common metric in item response theory. Applied Psychological Measurement, 7(2), 201-210. https://doi.org/10.1177 / 014662168300700208
  • Taşdemir, F., & Çokluk, Ö. ( 2013). Angoff (1-0), Nedelsky and examination of classification accuracies of a test by determination methods of limit values. Mediterranean Journal of Humanities, 3(2), 241-261. https://doi.org/10.13114/mjh/201322482
  • Tian, F. (2011). A comparison of equating / linking using the Stocking-Lord method and concurrent calibration with mixed-format tests in the non-equivalent groups common-item design under IRT [Unpublished doctoral dissertation]. Boston College.
  • Thissen, D. M., & Wainer, H. (1982). Some standard errors in item response theory. Psychometrika, 47, 397-412. https://doi.org/10.1007/BF02293705
  • Wang, T. (2006). Standard errors of equating for equipercentile equating with log-linear pre-smoothing using the delta method (Report No. 14). Center for Advanced Studies in Measurement and Assessment, Iowa.
  • Weeks, J. P. (2010). Plink: An R package for linking mixed-format tests using IRT-based methods. Journal of Statistical Software, 35(12), 1-33. https://cran.r-project.org/web/packages/plink/vignettes/plink-UD.pdf
  • Wiberg. M., & Branberg, K. (2015). Kernel equating under the non-equivalent groups with covariates design. Applied Psychological Measurement, 39(5), 349–361. https://doi.org/10.1177/0146621614567939
  • Wiberg, M., & Gonzalez, J (2016). Statistical assessment of estimated transformations in observed-score equating. Journal of Educational Measurement. 53(1), 106–125. http://www.mat.uc.cl/~jorge.gonzalez/papers/TR/Assess_TR.pdf