Impact of Retrofitting and Item Ordering on DIF

Richer diagnostic information about examinees’ cognitive strength and weaknesses are obtained from cognitively diagnostic assessments (CDA) when a proper cognitive diagnosis model (CDM) is used for response data analysis. To do so, researchers state that a preset cognitive model specifying the underlying hypotheses about response data structure is needed. However, many real data CDM applications are adds-on to simulation studies and retrofitted to data obtained from non-CDAs. Such a procedure is referred to as retrofitting, and fitting CDMs to traditional test data is not uncommon. To deal with a major validity concern of item/test bias in CDAs, some recent DIF detection techniques compatible with various CDMs have been proposed. This study employs several DIF detection techniques developed based on CTT, IRT, and CDM frameworks and compares the results to understand the extent to which DIF flagging behavior of items is affected by retrofitting. A secondary purpose of this study is to gather evidence about test booklet effects (i.e., item ordering) on items’ psychometric properties through DIF analyses. Results indicated severe DIF flagging prevalence differences for items across DIF detection techniques employing Wald test, Raju’s area measures, and Mantel-Haenzsel statistics. The largest numbers of DIF cases were observed when the data were retrofitted to a CDM. The results further revealed that an item might be flagged as DIF in one booklet, whereas it might not be flagged in another.

___

  • Ankara University (2011). Ankara Üniversitesi Eğitim Bilimleri Fakültesi’nin YGS Hakkında Görüşü. Retrieved November 30, 2015, form https://dahilibellek.wordpress.com/2011/04/12/ankara-ebf-ygs/
  • Camilli, G., & Shepard, L. A. (1994). Methods for identifying biased test items. Newbury Park, CA: Sage.
  • Chambers, J.M., Cleveland, W.S., Tukey, P.A., Kleiner, B. (1983). Graphical Methods for Data Analysis. Wadsworth International Group, the University of Michigan.
  • de Ayala, R. J. (2009). The Theory and Practice of Item Response Theory. The Guilford Press, New York, NY.
  • de la Torre, J. (2009). DINA model and parameter estimation: A didactic. Journal of Educational and Behavioral Statistics, 47, 115-127
  • de la Torre, J., Hong, Y., & Deng, W. (2010). Factors affecting the item parameter estimation and classification accuracy of the DINA model. Journal of Educational Measurement 47, 227–249.
  • de la Torre, J., & Minchen N. (2014). Cognitively diagnostic assessments and the cognitive diagnosis framework. Psicologia Educative 20, 89-97
  • De Carlo, L. T. (2012). Recognizing uncertainty in the Q-matrix via a Bayesian extension of the DINA model. Applied Psychological Measurement, 36, 447-468.
  • Gierl, M. J., Alves, C., & Majeau, R. T. (2010). Using the attribute hierarchy method to make diagnostic inferences about examinees’ knowledge and skills in mathematics: An operational implementation of cognitive diagnostic assessment. International Journal of Testing, 10, 318-341. Doi:10.1080/15305058.2010.509554
  • Gierl, M.J., & Cui, Y. (2008). Defining characteristics of diagnostic classification models and the problem of retrofitting in cognitive diagnostic assessment. Measurement, 6, 263-275.
  • Holland, P. W., & Thayer, D. (1988). Differential item performance and the Mantel-Haenszel procedure. In H. Wainer & H. I. Braun (Eds.), Test validity (129-145). Hillsdale, NJ: Lawrence Erlbaum Associates.
  • Junker, B. W., & Sijtsma, K. (2001). Cognitive assessment models with few assumptions, and connections with nonparametric item response theory. Applied Psychological Measurement, 25, 258-272.
  • Kane, M. T. (2013). Validating the interpretations and uses of test scores. Journal of Educational Measurement. 50, 1–73.
  • Kingston, N. M., & Dorans, N. J. (1984). Item location effects and their implications for IRT equating and adaptive testing. Applied Psychological Measurement, 8(2), 147-154.
  • Li, F. (2008) Modified higher-order DINA model for de11tecting differential item functioning and differential attribute functioning. Unpublished doctoral dissertation University of Georgia, USA.
  • Magis, D., Beland, S., & Raiche, G., (2015). difR: Collection of methods to detect dichotomous differential item functioning (DIF). R package version 4.6.
  • Mantel, N., & Haenszel, W. M. (1959). Statistical aspects of the analysis of data from retrospective studies of disease. Journal of the National Cancer Institute, 22, 719-748.
  • Messick, S. (1995). Validity of psychological assessment: Validation of inferences from person’s responses and performances as scientific inquiry into score meaning. American Psychologist, 50, 741-749.
  • ODTÜ (2011). 2011 Yılı Yükseköğretime Geçiş Sınavı Hakkında ODTÜ Eğitim Fakültesi Görüşü. Retrieved November 30, 2015, form http://fedu.metu.edu.tr/sites/fedu.metu.edu.tr/files/ygs2011hkegitimfakultesigorusu_28_4_2011_v2.pdf
  • ÖSYM (2011). Adaya özgü soru kitapçığı. Retrieved December 29, 2015, file://localhost/from http/::www.osym.gov.tr:belge:1-12431:adaya-ozgu-soru-kitapcigi-21032011.html
  • R Core Team (2016). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.
  • Raju, N. S. (1988). The area between two item characterıstıc curves. Psychometrika, 3, 495-502.
  • Raju, N. S. (1990). Determining the significance of estimated signed and unsigned areas between two item response functions. Applied Psychological Measurement, 14, 197-207.
  • Rupp, A. A., & Templin, J. L. (2008). Unique characteristics of diagnostic classification models: A comprehensive review of the current state-of-the-art. Measurement, 6(4), 219-262.
  • Sessoms, J. & Henson, R. A. (2018). Applications of Diagnostic Classification Models: A literatüre review and critical commentary. Measurement: Interdisciplinary research and persperctives, 1, 1-17.
  • Zumbo, B. D. (2007). Three generations of DIF analyses: Considering where it has been, where it is now, and where it is going. Language Assessment Quarterly, 4, 223-233.