Gender-based Differential Item Functioning Analysis of the Medical Specialization Education Entrance Examination

The Medical Specialization Education Entrance Examination is a national high-stake test for the placement of medical graduates in medical specialization training in Turkey. The purpose of this study is to determine whether the Medical Specialization Education Entrance Examination items display gender-related differential item functioning (DIF) by using Mantel-Haenszel and logistic regression methods. To determine the presence of item bias, content experts reviewed items. The analyzes were conducted on the answers of 11,530 physicians to the Basic Medical Sciences and Clinical Medical Sciences tests of the 2017 Medical Specialization Education Entrance Examination spring term. According to the Mantel-Haenszel method, there were eleven out of 234 items identified as showing B level gender-related DIF. While six of the items functioned in favor of male physicians, five of them were in favor of female physicians. Since the number of items in favor of each gender is close, DIF cancellation occurs. According to content areas, one histology and embryology, one internal medicine, and three gynecology and obstetrics items were in favor of female physicians, one physiology, two medical pharmacology, one pediatrics, and two surgical items were in favor of male physicians. To the experts’ reviews, there are no biased items. The medical specialty preferences of the physicians and content area of the displaying differential item functioning items overlapped.

___

  • Akcan, R., & Atalay Kabasakal, K. (2019). An investigation of item bias of English test: The case of 2016 year undergraduate placement exam in Turkey. International Journal of Assessment Tools in Education, 6(1), 48-62. https://doi.org/10.21449/ijate.508581
  • Allaouf, A., Hambleton, R., & Sireci, S. (1999). Identifying the causes of translation DIF on verbal items. Journal of Educational Measurement, 36(3), 185-198. https://www.jstor.org/stable/1435153
  • American Educational Research Association. (2018). Standards for educational and psychological testing. American Educational Research Association.
  • Assessment, Selection and Placement Center [Ölçme Seçme ve Yerleştirme Merkezi, ÖSYM]. (2017). 2017 Tıpta Uzmanlık Eğitimi Giriş Sınavı başvuru kılavuzu. Retrieved from: https://dokuman.osym.gov.tr/pdfdokuman/2017/TUSILKBAHAR/BASVURUKILAVUZU26042017 .pdf
  • Bakan Kalaycıoğlu, D. (2020). Changes in physicians’ specalization preferences from 1987 to 2017. Tıp Eğitimi Dünyası, 19(59), 157-170. https://doi.org/10.25282/ted.696179
  • Bakan Kalaycıoğlu, D., & Berberoğlu, G. (2011). Differential item functioning analysis of the science and mathematics items in the university entrance examinations in Turkey. Journal of Psychoeducational Assessment, 29(5), 467-478. https://doi.org/10.1177%2F0734282910391623
  • Berrío, Á. I., Gomez-Benito, J., & Arias-Patiño, E. M. (2020). Developments and trends in research on methods of detecting differential item functioning. Educational Research Review, 31, 100340. https://doi.org/10.1016/j.edurev.2020.100340
  • Browne, M. W., & Cudeck, R. (1993). Alternative ways of assessing model fit. In K. A. Bollen & J. S. Long (Eds.), Testing structural equation models (pp. 136-162). Sage.
  • Camilli, G., & Shepard, L.A. (1994). Methods for identifying biased test items. Sage Publications.
  • Clauser, B. E., Nungester, R. J., Mazor, K., & Ripkey, D. (1996a). A comparison of alternative matching strategies for DIF detection in tests that are multidimensional. Journal of Educational Measurement, 33(2), 202-214. https://doi.org/10.1111/j.1745-3984.1996.tb00489.x
  • Clauser, B. E., Nungester, R. J., & Swaminathan, H. (1996b). Improving the matching for DIF analysis by conditioning on both test score and an educational background variable. Journal of Educational Measurement, 33(4), 453-464. https://doi.org/10.1111/j.1745-3984.1996.tb00501.x
  • Crane, P. K., Belle, G. van, & Larson, E. B. (2004). Test bias in a cognitive test: Differential item functioning in the CASI. Statistics in Medicine, 23(2), 241-256. https://doi.org/10.1002/sim.1713
  • Çelik, M., & Özer Özkan, Y. (2020). Analysis of differential item functioning of PISA 2015 mathematics subtest subject to gender and statistical regions. Journal of Measurement and Evaluation in Education and Psychology, 11(3), 283-301. https://doi.org/10.21031/epod.715020
  • Çepni, Z., & Kelecioğlu, H. (2021). Detecting differential item functioning using SIBTEST, MH, LR and IRT methods. Journal of Measurement and Evaluation in Education and Psychology, 12(3), 267-285. https://doi.org/10.21031/epod.988879
  • Diaz, E., Brooks, G., & Johanson, G. (2021). Detecting differential item functioning: Item Response Theory methods versus the Mantel-Haenszel procedure. International Journal of Assessment Tools in Education, 8(2), 376-393. https://doi.org/10.21449/ijate.730141
  • Dorans, N. J., & Holland, P. W. (1992). DIF detection and description: Mantel‐Haenszel and standardization (Research Report 92-10). Educational Testing Service.
  • Downing, S. M. (2002). Threats to the validity of locally developed multiple-choice tests in medical education: Construct-irrelevant variance and construct underrepresentation. Advances in Health Sciences Education, 7(3), 235-241. https://doi.org/10.1023/A:1021112514626
  • Downing, S. M., & Yudkowsky, R. (2009). Introduction to assessment in the health professions. In Assessment in health professions education (pp. 21-40). Routledge.
  • Edelen, M. O., Thissen, D., Teresi, J. A., Kleinman, M., & Ocepek-Welikson, K. (2006). Identification of differential item functioning using item response theory and the likelihood-based model comparison approach: Application to the Mini-Mental State Examination. Medical Care, 44(11), 134-142. https://doi.org/10.1097/01.mlr.0000245251.83359.8c
  • Finch, H. (2005). The MIMIC model as a method for detecting DIF: Comparison with Mantel-Haenszel, SIBTEST, and the IRT likelihood ratio. Applied Psychological Measurement, 29(4), 278-295. https://doi.org/10.1177/0146621605275728
  • Finch, W. H., & French, B. F. (2007). Detection of crossing differential item functioning: A comparison of four methods. Educational and Psychological Measurement, 67(4), 565-582. https://doi.org/10.1177/0013164406296975
  • Gomez-Benito, J., & Navas-Ara, M. J. (2000). A comparison of χ2, RFA and IRT based procedures in the detection of DIF. Quality and Quantity, 34(1), 17-31. https://doi.org/10.1023/A:1004703709442
  • Grover, R. K., & Ercikan, K. (2017). For which boys and which girls are reading assessment items biased against? Detection of differential item functioning in heterogeneous gender populations. Applied Measurement in Education, 30(3), 178-195. https://doi.org/10.1080/08957347.2017.1316276
  • Guilera, G., Gómez-Benito, J., Hidalgo, M. D., & Sánchez-Meca, J. (2013). Type I error and statistical power of the Mantel-Haenszel procedure for detecting DIF: A meta-analysis. Psychological Methods, 18(4), 553-571. https://psycnet.apa.org/doi/10.1037/a0034306
  • Güler, N., & Penfield, R. D. (2009). A comparison of the logistic regression and contingency table methods for simultaneous detection of uniform and nonuniform DIF. Journal of Educational Measurement, 46(3), 314-329. https://doi.org/10.1111/j.1745-3984.2009.00083.x
  • Hambleton, R. K. (2006). Good practices for identifying differential item functioning. Medical Care, 44(11), 182-188. https://doi.org/10.1097/01.mlr.0000245443.86671.c4
  • Hidalgo, M. D., & Lopez-Pina, J. A. (2004). Differential item functioning detection and effect size: A comparison between logistic regression and Mantel-Haenszel procedures. Educational and Psychological Measurement, 64(6), 903-915. https://doi.org/10.1177%2F0013164403261769
  • Holland, P. W., & Thayer, D. T. (1986, April 16-20). Differential item performance and the Mantel-Haenszel procedure [Paper presentation]. 67th Annual Meeting of the American Educational Research Association, San Francisco, CA.
  • Hope, D., Adamson, K., McManus, I. C., Chis, L., & Elder, A. (2018). Using differential item functioning to evaluate potential bias in a high stakes postgraduate knowledge based assessment. BMC Medical Education, 18, 64. https://doi.org/10.1186/s12909-018-1143-0
  • Hu, L. T., & Bentler, P. M. (1998). Fit indices in covariance structure modeling: Sensitivity to underparameterized model misspecification. Psychological Methods, 3(4), 424-453. https://doi.org/10.1037/1082-989X.3.4.424
  • Jodoin, M. G., & Gierl, M. J. (2001). Evaluating type I error and power rates using an effect size measure with the logistic regression procedure for DIF detection. Applied Measurement in Education, 14(4), 329-349. https://doi.org/10.1207/S15324818AME1404_2
  • Jöreskog, K. G., & Sörbom, D. (1993). LISREL 8: Structural equation modeling with the SIMPLIS command language. Scientific Software International Inc.
  • Kelly, S., & Dennick, R. (2009). Evidence of gender bias in true-false-abstain medical examinations. BMC Medical Education, 9(1), 1-7. https://doi.org/10.1186/1472-6920-9-32
  • Khorramdel, L., Pokropek, A., Joo, S. H., Kirsch, I., & Halderman, L. (2020). Examining gender DIF and gender differences in the PISA 2018 reading literacy scale: A partial invariance approach. Psychological Test and Assessment Modeling, 62(2), 179-231.
  • Kıbrıslıoğlu Uysal, N., & Atalay Kabasakal, K. (2017). The effect of background variables on gender related differential item functioning. Journal of Measurement and Evaluation in Education and Psychology, 8(4), 373-390. https://doi.org/10.21031/epod.333451
  • MacIntosh, R., & Hashim, S. (2003). Variance estimation for converting MIMIC model parameters to IRT parameters in DIF analysis. Applied Psychological Measurement, 27(5), 372-379. https://doi.org/ 10.1177/0146621603256021
  • Marsh, H. W., Hau, K. T., & Wen, Z. (2004). In search of golden rules: Comment on hypothesis-testing approaches to setting cutoff values for fit indexes and dangers in overgeneralizing Hu and Bentler’s (1999) findings. Structural Equation Modeling, 11(3), 320–341. https://doi.org/10.1207/s15328007sem1103_2
  • Muthen, B. O. (1988). Some uses of structural equation modeling validity studies: Extending IRT to external variables. In H. Wainer & H. Braun (Eds.), Test validity (pp. 213-238). Lawrence Erlbaum.
  • Narayanan, P., & Swaminathan, H. (1996). Identification of items that show nonuniform DIF. Applied Psychological Measurement, 20(3), 257-274. https://doi.org/10.1177/014662169602000306
  • Oort, F. J. (1992). Using restricted factor analysis to detect item bias. Methodika, 6(2), 150-166.
  • Schumacker, R. E., & Lomax, R. G. (2010). A beginner’s guide to structural equation modeling (3rd ed.). Taylor and Francis Group.
  • Shepard, L. A. (1982). Definitions of bias. In R. A. Berk (Ed.), Handbook of methods for detecting test bias (pp. 9-30). John Hopkins University Press.
  • Sunderland, M., Mewton, L., Slade, T., & Baillie, A. J. (2010). Investigating differential symptom profiles in major depressive episode with and without generalized anxiety disorder: True co-morbidity or symptom similarity? Psychological Medicine, 40(7), 1113-1123. https://doi.org/10.1017/S0033291709991590
  • Swaminathan, H., & Rogers, H. J. (1990). Detecting differential item functioning using logistic regression procedures. Journal of Educational Measurement, 27(4), 361-370. https://www.jstor.org/stable/1434855
  • Swanson, D. B., Clauser, B. E., Case, S. M., Nungester, R. J., & Featherman, C. (2002). Analysis of differential item functioning (DIF) using hierarchical logistic regression models. Journal of Educational and Behavioral Statistics, 27(1), 53-75. https://doi.org/10.3102/10769986027001053
  • Teresi, J. A. (2006). Different approaches to differential item functioning in health applications: Advantages, disadvantages and some neglected topics. Medical Care, 44(11), S152-S170. https://doi.org/10.1097/01.mlr.0000245142.74628.ab
  • Thissen, D., Steinberg, L., & Wainer, H. (1993). Detection of differential item functioning using the parameters of item response models. In P. W. Holland & H. Wainer (Ed.), Differential item functioning (pp. 67-114). Lawrence Erlbaum Associates.
  • Uğurlu, S., & Atar, B. (2020). Performances of MIMIC and logistic regression procedures in detecting DIF. Journal of Measurement and Evaluation in Education and Psychology, 11(1), 1-12. https://doi.org/10.21031/epod.531509
  • Wainer, H., & Sireci, S. G. (2005). Encyclopedia of social measurement. ScienceDirect.
  • Waller, N. G. (1998). EZDIF: Detection of uniform and nonuniform differential item functioning with the Mantel-Haenszel and logistic regression procedures. Applied Psychological Measurement, 22(4), 391-391. https://doi.org/10.1177/014662169802200409
  • Wyse, A. E. (2013). DIF cancellation in the Rasch model. Journal of Applied Measurement, 14(2), 118-128.
  • Zieky, M. (1993). Practical questions in the use of DIF statistics in test development. In P. W. Holland & H. Wainer (Eds.), Differential item functioning (pp. 337–347). Lawrence Erlbaum Associates.
  • Zumbo, B. D. (1999). A handbook on the theory and methods of Differential Item Functioning (DIF): Logistic regression modeling as a unitary framework for binary and Likert-type (ordinal) item scores. Directorate of Human Resources Research and Evaluation, Department of National Defense.
  • Zumbo, B. D., & Gelin, M. N. (2005). A matter of test bias in educational policy research: Bringing the context into picture by investigating sociological/community moderated (or mediated) test and item bias. Journal of Educational Research & Policy Studies, 5(1), 1-23.