The Impact of Ignoring Multilevel Data Structure on the Estimation of Dichotomous Item Response Theory Models

Applying single-level statistical models to multilevel data typically produces underestimated standard errors, which may result in misleading conclusions. This study examined the impact of ignoring multilevel data structure on the estimation of item parameters and their standard errors of the Rasch, two-, and three- parameter logistic models in item response theory (IRT) to demonstrate the degree of such underestimation in IRT. Also, the Lord’s chi-square test using the underestimated standard errors was used to test differential item functioning (DIF) to show the impact of such underestimation on the practical applications of IRT. The results of simulation studies showed that, in the most severe case of multilevel data, the standard error estimate from the standard single-level IRT models was about half of the minimal asymptotic standard error, and the type I error rate of the Lord’s chi-square test was inflated up to .35. The results of this study suggest that standard single-level IRT models may seriously mislead our conclusions in the presence of multilevel data, and therefore multilevel IRT models need to be considered as alternatives.

The Impact of Ignoring Multilevel Data Structure on the Estimation of Dichotomous Item Response Theory Models

Applying single-level statistical models to multilevel data typically produces underestimated standard errors, which may result in misleading conclusions. This study examined the impact of ignoring multilevel data structure on the estimation of item parameters and their standard errors of the Rasch, two-, and three- parameter logistic models in item response theory (IRT) to demonstrate the degree of such underestimation in IRT. Also, the Lord’s chi-square test using the underestimated standard errors was used to test differential item functioning (DIF) to show the impact of such underestimation on the practical applications of IRT. The results of simulation studies showed that, in the most severe case of multilevel data, the standard error estimate from the standard single-level IRT models was about half of the minimal asymptotic standard error, and the type I error rate of the Lord’s chi-square test was inflated up to .35. The results of this study suggest that standard single-level IRT models may seriously mislead our conclusions in the presence of multilevel data, and therefore multilevel IRT models need to be considered as alternatives.

___

  • Barcikowski, R. S. (1981). Statistical power with group mean as the unit of analysis. Journal of Educational and Behavioral Statistics, 6, 267–285.
  • De Ayala, R. J. (2009). The theory and practice of item response theory. New York, NY: The Guildford Press.
  • Embretson, S. E., & Reise, S. P. (2000). Item response theory. Mahwah, NJ: Erlbaum.
  • Finch, W. H., & French, B. F. (2011). Estimation of mimic model parameters with multilevel data. Structural Equation Modeling, 1, 229–252.
  • Goldstein, H. (1987). Multilevel statistical models. London: Edward Arnold.
  • Hedges, L. V., & Hedberg, E. C. (2007). Intraclass correlation values for planning group-randomized trials in education. Educational Evaluation and Policy Analysis, 29, 60–87. Hox, J. (1998). Multilevel modeling: When and why. In Classification, data analysis, and data highways (pp. 147–154). Springer.
  • Jiao, H., Kamata, A., Wang, S., & Jin, Y. (2012). A multilevel testlet model for dual local dependence. Journal of Educational Measurement, 49, 82–100.
  • Julian, M. W. (2001). The consequences of ignoring multilevel data structures in nonhierarchical covariance modeling. Structural Equation Modeling, 8, 325–352.
  • Kamata, A. (2001). Item analysis by the hierarchical generalized linear model. Journal of Educational Measurement, 38, 79–93.
  • Kamata, A., & Vaughn, B. K. (2011). Multilevel IRT modeling. Handbook of advanced multilevel analysis (pp. 41-57). New York, NY: Taylor and Francis Group.
  • Kim, S.-H., & Cohen, A. S. (1995). A comparison of lord’s chi-square, raju’s area measures, and the likelihood ratio test on detection of differential item functioning. Applied Measurement in Education, 8, 291–312.
  • Kish, L. (1965). Survey sampling. New York: Wiley.
  • Lord, F. M. (1980). Applications of item response to theory to practical testing problems. Hillsdale, NJ: Lawrence Erlbaum Associates.
  • Maas, C. J., & Hox, J. J. (2005). Sufficient sample sizes for multilevel modeling Methodology: European Journal of Research Methods for the Behavioral and Social Sciences, 1, 86–92.
  • Oshima, T., Raju, N. S., & Nanda, A. O. (2006). A new method for assessing the statistical significance in the differential functioning of items and tests (dfit) framework. Journal of Educational Measurement, 43, 1–17.
  • R Core Team. (2013). R: A language and environment for statistical computing [Computer software manual]. Vienna, Austria. Retrieved from http://www.R-project.org/
  • Raudenbush, S. W. (1997). Statistical analysis and optimal design for cluster randomized trials. Psychological Methods, 2, 173.
  • Reckase, M. (2009). Multidimensional item response theory. New York: Springer. Satorra, A., & Muthen, B. (1995). Complex sample data in structural equation modeling. Sociological Methodology, 25, 267–316.
  • Snijders, T. A., & Bosker, R. J. (1999). Introduction to multilevel analysis. London: Sage.
  • Snijders, T. A., & Bosker, R. J. (2011). Multilevel analysis: An introduction to basic and advanced multilevel modeling. London: Sage Publishers.
  • Stocking, M. L., & Lord, F. M. (1983). Developing a common metric in item response theory. Applied Psychological Measurement, 7, 201–210.
  • Tate, R. L., & Wongbundhit, Y. (1983). Random versus nonrandom coefficient models for multilevel analysis. Journal of Educational and Behavioral Statistics, 8, 103–120.
  • Thissen, D., & Wainer, H. (1982). Some standard errors in item response theory. Psychometrika, 47 (4), 397–412.
  • Toland, M. D. (2008). Determining the accuracy of item parameter standard error of estimates in bilog-mg 3. ProQuest.
  • Veerkamp, W. J., & Glas, C. A. (2000). Detection of known items in adaptive testing with a statistical quality control method. Journal of Educational and Behavioral Statistics, 25, 373–389.
  • Wright, B., & Stone, M. (1979). Best test design: A handbook for rasch Measurement. Chicago: MESA.
  • Zimowski, M. F., Muraki, E., Mislevy, R. J., & Bock, R. D. (1996). Bilog-mg: Multiple-group IRT analysis and test maintenance for binary items. Chicago: Scientific Software International, 4, 10.