Comparability between the American and Turkish versions of the TIMSS mathematics test results

Bu çalışmada, Amerika ve Türkiyede elde edilen 1999 Uluslararası Matematik ve Fen Eğilimleri Araştırması matematik test sonuçlarının ne ölçüde karşılaştırılabilir olduğu ele alınmıştır. Ölçme değişmezliği farklı işleyen madde analizleri ve açıklayıcı faktör analizleriyle incelenmiştir. Bu düzeyde görülen farklılıkların puanlama ölçeğine etkisi ise test karakteristik eğrileri karşılaştırılarak incelenmiştir. Matematik testindeki maddelerin yaklaşık %23ünün bu iki ülke arasında farklı işlediği belirlenmiştir. Diğer yandan faktör analiz sonuçları testlerin yapıları arasında da farklılık olduğunu göstermiştir. Bununla birlikte, iki farklı dildeki testlere ait test karakteristik eğrileri incelendiğinde bu farklılıkların puanlama ölçeğine etkisinin oldukça düşük olduğu görülmüştür.

TIMSS Matematik test sonuçlarının Amerika ve Türkiye arasında karşılaştırılabilirliği

This study examined the degree of comparability between two versions of the Trends in International Mathematics and Science Study s 1999 mathematics test results from the United States of America and Turkey. Measurement invariance was assessed between the two language versions of tests using differential item functioning analyses and exploratory factor analyses. The impact of the differences on the score scale comparability was also examined by comparing the test characteristic curves. Approximately 23% of the items were identified as differentially functioning for the two countries. The factor analyses indicated differences in the structure of the two tests. However, the effect of these differences on score scale comparability was minimal as was demonstrated by very similar test characteristic curves for the two language versions.

___

  • American Educational Research Association, American Psychological Association, & National Council on Measurement in Education (AERA, APA, & NCME: 1999). Standards for Educational and Psychological Testing. Washington, DC: American Psychological Association.
  • Angoff, W. H. (1993). Perspective on differential item functioning methodology. In P. W. Holland & H.Wainer (Eds.), Differential item functioning (pp. 3-24). Hillsdale, NJ: Lawrence Erlbaum.
  • Angoff, W. H., & Cook, L. L. (1988). Equating the scores of the Prueba de Aptitud Academica and the Scholastic Aptitude Test (College Board Report No. 88-2). New York: College Entrance Examination Board.
  • Berberoglu, G. & Sireci, S. G. (1996). Evaluating translation fidelity using bilingual examinees. Laboratory of Psychometric and Evaluative Research Report No. 285. Amherst, MA: University of Massachusetts, School of Education.
  • Burket, G. (1991). PARDUX [Computer program]. Unpublished.
  • Butcher, J. N., & Garcia, R. E. (1978). Cross-national application of psychological tests. The Personnel and Guidance Journal, 56, 472-486.
  • Cook, L. (August, 1996). Establishing score comparability for tests given in different languages. Paper presented at the meeting of the American Psychological Association, Toronto, Canada.
  • Cook, L. (July, 2006). Practical considerations in linking scores on adapted tests. Keynote address at the 5th international meeting of the International Test Commission, Brussels, Belgium.
  • Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16, 297-334.
  • CTB/McGraw-Hill (1991). PARDUX. [Computer software]. CTB/McGraw-Hill. Monterey, CA.
  • Ercikan, K. (1998). Translation Effects in International Assessments. International Journal of Educational Research, 29, 543-553.
  • Ercikan, K. (2002). Disentangling Sources of Differential Item Functioning in Multilanguage Assessments. International Journal of Testing, 4, 199-215.
  • Ercikan, K. (2003). Are the English and French Versions of the Third International Mathematics and Science Study Administered in Canada Comparable? Effects of Adaptations. International Journal of Educational Policy, Research and Practice.
  • Ercikan, K. (2006). Developments in assessment of student learning and achievement. In P.A. Alexander and P. H. Winne (Eds.), American Psychological Association, Division 15, Handbook of educational psychology, 2nd edition (pp. 929-953). Lawrence Erlbaum.
  • Ercikan, K. (2009). Limitations in sample to population generalizing. In K. Ercikan & M-W. Roth (Eds.), Generalizing in Educational Research: Beyond Qualitative and Quantitative Polarization (pp. 211-235), New York: Routledge.
  • Ercikan, K., Arim, R.,G., Law, D. M., Lacroix, S., Gagnon, F., & Domene, J. F. (2010). Application of think-aloud protocols in examining sources of differential item functioning. Educational Measurement: Issues and Practice, 29, 24-35. Ercikan, K., Gierl, M. J., McCreith, T., Puhan, G., & Koh, K. (2004). Comparability of Bilingual Versions of Assessments: Sources of Incomparability of English and French Versions of Canada's National Achievement Tests. Applied Measurement in Education, 17, 301-321.
  • Ercikan, K. & Gonzalez, E. (March, 2008). Score scale comparability in PIRLS. Paper presented at the National Council on Measurement in Education, New York, NY, USA.
  • Ercikan, K., & Koh, K. (2005). Construct Comparability of the English and French versions of TIMSS.International Journal of Testing, 5, 23-35.
  • Ercikan, K., & Lyons-Thomas, J. (2013). Adapting tests for use in other languages and cultures. In K.
  • Geisinger (Ed.), APA handbook of testing and assessment in psychology, Volume 3, (pp. 545-569).American Psychological Association: Washington, DC.
  • Ercikan, K., & McCreith, T. (2002). Effects of Adaptations on Comparability of Test Items and Test Scores. In D. Robitaille & A. Beaton (Eds.) Secondary analysis of the TIMSS results: A synthesis of current research (pp. 391-407). Dordrecht, the Netherlands, Kluwer Academic Publishers.
  • Ercikan, K. & Oliveri, M. E. (in press). Are our current methods of investigating test fairness doing justice? Population heterogeneity and DIF analysis to be published in validity book from proceedings of ETS and NY Teacher’s College joint conference.
  • Ercikan, K., Roth, W-M., Asil, M. (in press). Cautions about uses of international assessments. Teachers College Record.
  • Ercikan, K., Roth, M., Simon, M., Lyons-Thomas, J., & Sandilands, D. (in press). Assessment of linguistic minority students. Applied Measurement in Education.
  • Ercikan, K., Simon, M., & Oliveri, M. E. (2013). Score comparability of multiple language versions of assessments within jurisdictions. In M. Simon, K. Ercikan, & M. Rousseau. (Eds.), Improving large-scale assessment in education: Theory, issues and practice. (pp. 110-124). New York: Routledge/Taylor & Francis.
  • Geisinger, K. F. (1994). Cross-cultural normative assessment: Translation and adaptation issues influencing the normative interpretation of assessment instruments. Psychological Assessment, 6,304-312.
  • Gierl, M. J., Rogers, W. T., & Klinger, D. (1999). Consistency between statistical procedures and content reviews for identifying translation DIF. Paper presented at the annual meeting of the National Council on Measurement in Education, Montréal, Quebec, Canada.
  • Gonzales, E. J., & Miles, J. A. (Eds.). (2001). TIMSS 1999 User guide for the international database: IEA s repeat of the Third International Mathematics and Science Study at the Eight Grade. Retrieved from http://isc.bc.edu/timss1999i/data /bm2_userguide.pdf
  • Hair, J. F., Anderson, R. E., Tatham, R. L., & Black, W. L. (1998). Multivariate data analysis (5th ed.). Upper Saddle River, NJ: Prentice-Hall.
  • Hambleton, R. K. (1993). Translating achievement tests for use in cross-cultural studies. European Journal of Psychological Assessment, 9, 57-68.
  • Hambleton, R. K. (1994). Guidelines for adapting educational and psychological tests: A progress report. European Journal of Psychological Assessment, 10, 229-244.
  • Hambleton, R. K. (2003). Advances in translating and adapting educational and psychological tests. Language Testing, 20, 127-134.
  • Hambleton, R. K. (2005). Issues, Designs, and Technical Guıdelines for Adapting Tests into Multıple Languages and Cultures. In R.K. Hambleton, P. Merenda, & C. Spielberger (Eds.). Adapting educational and psychological tests for cross-cultural assessment (pp. 93-115). Hillsdale, NJ: Lawrence Erlbaum.
  • Hambleton, R. K., Merenda, P. F., & Spielberger, C. D. (2005). Adapting educational and psychological test for cross-cultural assessment. Mahwah, NJ: Lawrence Erlbaum.
  • Hui, C. H., & Triandis, H. C. (1985). Individualism-collectivism: A study of cross-cultural researchers. Journal of Cross-Cultural Psychology, 17, 225-248.
  • Kim, J-O., & Mueller, C. W. (1978). Factor analysis: Statistical methods and practical issues: Newbury Park: Sage.
  • Linn, R. L., & Harnisch, D. L. (1981) Interactions between item content and group measurement on achievement test items. Journal of Educational Measurement, 18, 109-118.
  • Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Erlbaum.
  • Oliveri, M. & Ercikan, K. (2011). Do different approaches to examining construct comparability lead to similar conclusions? Applied Measurement in Education, 24, 349-366.
  • Oliveri, M., Olson, B., Ercikan, K., & Zumbo, B. (2012). Methodologies for investigating item- and test-level construct comparability in international large-scale assessments. International Journal of Testing, 12, 203-223.Poortinga, Y. H. (1991). Conceptual implications of item bias. In P. L. Dann, S. H. Irvine, & J. M. Collis (Eds.). Advances in computer- based human assessment (pp. 279-290). Dordecht, Netherlands: Kluwer Academic.
  • Reise, S. P. Widaman, K. F., & Pugh, R. H. (1993). Confirmatory factor analysis and item response theory: Two approaches for exploring measurement invariance. Psychological Bulletin, 114, 552-566.
  • Sireci, S. G. (1997). Problems and issues in linking assessment across languages. Educational Measurement: Issues and Practice, 16, 2-19.
  • Sireci, S. G., Bastari, B., & Alallouf, A. (1998, August). Evaluating construct equivalence across adapted tests. Invited paper presented at the meeting of the American Psychological Association, San Francisco.
  • Sireci, S. G., Bastari, B., Xing, D., Allalouf, A., & Fitzgerald C. (2003). Evaluating construct equivalence across tests adapted for use across multiple languages. Unpublished manuscript. University of Massachusetts at Amherst.
  • Sireci, S.G. & Berberoglu, G. (2000). Using bilingual respondents to evaluate translated-adapted items.Applied Measurement in Education, 35, 229-259.
  • Sireci, S. G., Fitzgerald, C., & Xing, D. (1998). Adapting credentialing examinations for international uses. Laboratory of Psychometric and Evaluative research report no. 329. Amherst, MA: University of Massachusetts, School of Education.
  • Sireci, S.G., Patsula, L., & Hambleton, R. K. (2005). Statistical methods for identifying flawed items in the test adaptations process. In R. K. Hambleton, P. Merenda, & C. Spielberger (Eds.). Adapting educational and psychological tests for cross-cultural assessment (pp. 93-115). Hillsdale, NJ: Lawrence Erlbaum.
  • Stocking, M. L., & Lord, F. M. (1983). Developing a common metric in item response theory. Applied Psychological Measurement, 7, 201-210.
  • Swaminathan, H., & Rogers, H. J. (1990). Detecting differential item functioning using logistic regression procedures. Journal of Educational Measurement, 27, 361-370.
  • van de Vijver, F., & Poortinga, Y. H. (1991). Testing across cultures. In R. K. Hambelton, J. Zaal (Eds.), Advances in educational and psychological testing (pp. 277-308). Boston: Kluwer Academic.
  • van de Vijver, F., & Tanzer, N. K. (1997). Bias and equivalence in cross-cultural assessment. European Review of Applied Psychology, 47, 263-279.
  • van de Vijver, F., & Tanzer, N. K. (1998). Bias and equivalence in crosscultural assessment. European Review of Applied Psychology, 47, 263-279.
  • Waller, N. G. (nd). EZDIF software for differential item functioning [Computer software and manual]. Retrieved from http://peabody.vanderbilt.edu/depts/psych_and_hd/ faculty/wallern/
  • Waller, N. G. (1998). EZDIF: Detection of uniform and nonuniform differential item functioning with the Mantel-Haenszel and Logistic Regression procedures. Applied Psychological Measurement, 22, 391.
  • Yen, W. M. (1993). Scaling performance assessments: Strategies for managing local item dependence. Journal of Educational Measurement, 30, 187-214.