Item parameter recovery via traditional 2PL, Testlet and Bi-factor models for Testlet-Based tests

The testlet comprises a set of items based on a common stimulus. When the testlet is used in the tests, there may violate the local independence assumption, and in this case, it would not be appropriate to use traditional item response theory models in the tests in which the testlet is included. When the testlet is discussed, one of the most frequently used models is the testlet response theory (TRT) model. In addition, the bi-factor model and traditional 2PL models are also used for testlet-based tests. This study aims to examine the item parameters estimated by these three calibration models of the data properties produced under different conditions and to compare the performances of the models. For this purpose, data were generated under three conditions: sample size (500, 1000, and 2000), testlet variance (.25, .50, and 1), and testlet size (4 and 10). For each simulation condition, the number of items in the test was fixed at i = 40 and 100 replications were made under each condition. Among these models, it was concluded that the TRT model gave less biased results than the other two models, but the results of the bi-factor model and the TRT were more similar as the sample size increased. Among the examined conditions, it was determined that the most effective variable in parameter recovery was the sample size.

Item parameter recovery via traditional 2PL, Testlet and Bi-factor models for Testlet-Based tests

The testlet comprises a set of items based on a common stimulus. When the testlet is used in the tests, there may violate the local independence assumption, and in this case, it would not be appropriate to use traditional item response theory models in the tests in which the testlet is included. When the testlet is discussed, one of the most frequently used models is the testlet response theory (TRT) model. In addition, the bi-factor model and traditional 2PL models are also used for testlet-based tests. This study aims to examine the item parameters estimated by these three calibration models of the data properties produced under different conditions and to compare the performances of the models. For this purpose, data were generated under three conditions: sample size (500, 1000, and 2000), testlet variance (.25, .50, and 1), and testlet size (4 and 10). For each simulation condition, the number of items in the test was fixed at i = 40 and 100 replications were made under each condition. Among these models, it was concluded that the TRT model gave less biased results than the other two models, but the results of the bi-factor model and the TRT were more similar as the sample size increased. Among the examined conditions, it was determined that the most effective variable in parameter recovery was the sample size.

___

  • Bradlow, E.T., Wainer, H., & Wang, X. (1999). A Bayesian random effects model for testlets. Psychometrika, 64(2), 153-168. https://doi.org/10.1007/bf02294533
  • Chalmers, R.P. (2020). mirt: Multidimensional item response theory. R package version 1.33.2. [Computer software manual]. http://www.R-project.org/
  • Chen, W.H., & Thissen, D. (1997). Local dependence indexes for item pairs using item response theory. Journal of Educational and Behavioral Statistics, 22(3), 265–289. https://doi.org/10.3102/10769986022003265
  • DeMars, C.E. (2006). Application of the bi‐factor multidimensional item response theory model to testlet based tests. Journal of Educational Measurement, 43(2), 145 168. https://doi.org/10.1111/j.1745-3984.2006.00010.x
  • DeMars, C.E. (2012). Confirming testlet effects. Applied Psychological Measurement, 36, 104 121. https://doi.org/10.1177/0146621612437403
  • Eckes, T. (2014). Examining testlet effects in the TestDaF listening section: A testlet response theory modeling approach. Language Testing, 31(1), 39 61. https://doi.org/10.1177/0265532213492969
  • Eckes, T., & Baghaei, P. (2015). Using testlet response theory to examine local dependence in C tests. Applied Measurement in Education, 28(2), 85 98. https://doi.org/10.1080/08957347.2014.1002919
  • Glas, C.A.W., Wainer, H., & Bradlow, E.T. (2000). MML and EAP estimation in testlet-based adaptive testing. In W. J. van der Linden & C. A. W. Glas (Eds.), Computerized adaptive testing: Theory and practice (pp. 271–288). Kluwer-Nijhoff.
  • Jiao, H., Kamata, A., Wang, S., & Jin, Y. (2012). A multilevel testlet model for dual local dependence. Journal of Educational Measurement, 49(1), 82 100. https://doi.org/10.1111/j.1745-3984.2011.00161.x
  • Jiao, H., Wang, S., & Kamata, A. (2005). Modeling local item dependence with the hierarchical generalized linear model. Journal of Applied Measurement, 6(3), 311-321.
  • Keller, L., Swaminathan, H., & Sireci, S.G. (2003). Evaluating scoring procedures for context-dependent item sets. Applied Measurement in Education, 16, 207 222. https://doi.org/10.1207/s15324818ame1603_3
  • Koziol, N.A. (2016). Parameter recovery and classification accuracy under conditions of testlet dependency: a comparison of the traditional 2PL, testlet, and bi-factor models. Applied Measurement in Education, 29(3), 184 195. https://doi.org/10.1080/08957347.2016.1171767
  • Li, F. (2017). An information‐correction method for testlet‐based test analysis: From the perspectives of item response theory and generalizability theory (Report No. ETS RR-17-27). ETS Research Report Series. https://doi.org/10.1002/ets2.12151
  • Liu Y, & Liu H.Y. (2012). When should we use testlet model? A comparison study of Bayesian testlet random-effects model and standard 2-pl bayesian model. Acta Psychologica Sinica, 44(2), 263-275. https://doi.org/10.3724/sp.j.1041.2012.00263
  • Lord, F.M., & Novick, M.R. (1968). Statistical theories of mental test scores. Addison-Wesley.
  • Luo, Y., & Wolf, M.G. (2019). Item parameter recovery for the two-parameter testlet model with different estimation methods. Psychological Test and Assessment Modeling, 61(1), 65-89.
  • Min, S., & He, L. (2014). Applying unidimensional and multidimensional item response theory models in testlet-based reading assessment. Language Testing, 31(4), 453-477. https://doi.org/10.1177/0265532214527277
  • Paek, I., & Cole, K. (2019). Using R for item response theory model applications. Routledge.
  • Pak, S. (2017). Ability parameter recovery of a computerized adaptive test based on rasch testlet models [Doctoral dissertation, University of Iowa]. Iowa University Libraries https://doi.org/10.17077/etd.5akqn3gy
  • Rijmen, F. (2010). Formal relations and an empirical comparison among the bi-factor, the testlet, and a second-order multidimensional IRT model. Journal of Educational Measurement, 47, 361-372. https://doi.org/10.1111/j.1745-3984.2010.00118.x
  • Sireci, S.G., Thissen, D., & Wainer, H. (1991). On the reliability of testlet‐based tests. Journal of Educational Measurement, 28(3), 237 247. https://doi.org/10.1002/j.2333 8504.1991.tb01389.x
  • Tao, W., & Cao, Y. (2016). An extension of IRT-based equating to the dichotomous testlet response theory model. Applied Measurement in Education, 29(2), 108 121. https://doi.org/10.1080/08957347.2016.1138956
  • Thissen, D., Steinberg, L., & Mooney, J.A. (1989). Trace lines for testlets: A use of multiplecategorical-response models. Journal of Educational Measurement, 26, 247-260. https://doi.org/10.1111/j.1745-3984.1989.tb00331.x
  • Tuerlinckx, F., & De Boeck, P. (2001). The effect of ignoring item iterations on the estimated discrimination parameters in item response theory. Psychological Methods, 6(2), 181-195. https://doi.org/10.1037/1082-989x.6.2.181
  • Wainer, H. (1995). Precision and differential item functioning on a testlet-based test: The 1991 Law School Admissions Test as an example. Applied Measurement in Education, 8, 157–186. https://doi.org/10.1207/s15324818ame0802_4
  • Wainer, H., Bradlow, E.T., & Du, Z. (2000). Testlet response theory: An analog for the 3PL model useful in testlet-based adaptive testing. In W. J. van der Linden & C.A.W. Glas (Eds.), Computerized adaptive testing: Theory and practice (pp. 245–270). Kluwer-Nijhoff.
  • Wainer, H., Bradlow, E.T., & Wang, X. (2007). Testlet response theory and its applications. Cambridge University Press.
  • Wainer, H., & Kiely, G.L. (1987). Item clusters and computerized adaptive testing: A case for testlets. Journal of Educational Measurement, 24, 185 201. https://doi.org/10.1111/j.1745-3984.1987.tb00274.x
  • Wainer, H., & Thissen, D. (1996). How is reliability related to the quality of test scores? What is the effect of local dependence on reliability? Educational Measurement: Issues and Practice, 15(1), 22-29. https://doi.org/10.1002/j.2333-8504.1998.tb01749.x
  • Wainer, H., & Wang, X. (2000). Using a new statistical model for testlets to score TOEFL. Journal of Educational Measurement, 37(3), 203-220. https://doi.org/10.1002/j.2333-8504.2001.tb01851.x
  • Yen, W.M. (1993). Scaling performance assessments: Strategies for managing local item dependence. Journal of Educational Measurement, 30(3), 187 213. https://doi.org/10.1111/j.1745-3984.1993.tb00423.x
  • Zenisky, A.L., Hambleton, R.K., & Sired, S.G. (2002). Identification and evaluation of local item dependencies in the Medical College Admissions Test. Journal of Educational Measurement, 39(4), 291-309. https://doi.org/10.1111/j.1745-3984.2002.tb01144.x