Investigating the Effect of Rater Training on Differential Rater Function in Assessing Academic Writing Skills of Higher Education Students

This study aimed to examine the effect of rater training on the differential rater function (rater error) in the process of assessing the academic writing skills of higher education students. The study was conducted with a pre-test and post-test control group quasi-experimental design. The study group of the research consisted of 45 raters, of whom 22 came from experimental, and 23 came from control groups. The raters were pre-service teachers who did not participate in any rater training before, and it was investigated that they had similar experiences in assessment. The data were collected using an analytical rubric developed by the researchers and an opinion-based writing task prepared by the International English Language Testing System (IELTS). Within the scope of the research, the compositions of 39 students that were written in a foreign language (English) were assessed. Many Facet Rasch Model was used for the analysis of the data, and this analysis was conducted under the Fully Crossed Design. The findings of the study revealed that the given rater training was effective on differential rater function, and suggestions based on these results were presented.

___

  • Aryadoust, V. (2016). Understanding the growth of ESL paragraph writing skills and its relationships with linguistic features. Educational Psychology, 36(10), 1742-1770. https://doi.org/10.1080/01443410.2014.950946
  • Attali, Y., Bridgeman, B., & Trapani, C. (2010). Performance of a generic approach in automated essay scoring. Journal of Technology, Learning, and Assessment, 10(3), 1-16. Retrieved from https://ejournals.bc.edu/ojs/index.php/jtla/article/view/1603
  • Baştürk, M. (2012). İkinci dil öğrenme algılarının belirlenmesi: Balıkesir örneği. Balikesir University Journal of Social Sciences Institute, 15(28-1), 251-270. Retrieved from http://dspace.balikesir.edu.tr/xmlui/handle/20.500.12462/4594
  • Bayat, N. (2014). Öğretmen adaylarının eleştirel düşünme düzeyleri ile akademik yazma başarıları arasındaki ilişki. Eğitim ve Bilim, 39(173), 155-168. Retrieved from http://eb.ted.org.tr/index.php/EB/article/view/2333
  • Bernardin, H. J. & Pence, E. C. (1980). Effects of rater training: New response sets and decreasing accuracy. Journal ofApplied Psychology, 65, 60-66. https://doi.org/10.1037/0021-9010.65.1.60
  • Bernardin, H. J., & Buckley, M. R. (1981). Strategies in rater training. Academy of Management Review, 6(2), 205-212. Retrieved from https://journals.aom.org/doi/abs/10.5465/amr.1981.4287782
  • Bijani, H. (2018). Investigating the validity of oral assessment rater training program: A mixed-methods study of raters’ perceptions and attitudes before and after training. Cogent Education, 5(1), 1-20. https://doi.org/10.1080/2331186X.2018.1460901
  • Bitchener, J., Young, S., & Cameron, D. (2005). The effect of different types of corrective feedback on ESL students. Journal of Second Language Writing, 14, 191–205. https://doi.org/10.1016/j.jslw.2005.08.001
  • Bond, T., & Fox, C. M. (2015). Applying the Rasch model: Fundamental measurement in the human sciences. New York and London: Routledge. https://doi.org/10.4324/9781315814698
  • Brennan, R.L., Gao, X., & Colton, D.A. (1995). Generalizability analyses of work key listening and writing tests. Educational and Psychological Measurement, 55(2), 157-176. https://doi.org/10.1177/0013164495055002001
  • Brijmohan, A. (2016). A many-facet Rasch measurement analysis to explore rater effects and rater training in medical school admissions. (Doktora Tezi). Retrieved from http://www.proquest.com/
  • Brookhart, S. M. (2013). How to create and use rubrics for formative assessment and grading. Alexandria, Virginia: ASCD.
  • Brown, H. D. (2007). Teaching by principles: An interactive approach to language pedagogy. New York: Pearson Education.
  • Brown, J. D., & Hudson, T. (1998). The alternatives in language assessment. TESOL quarterly, 32(4), 653-675. https://doi.org/10.2307/3587999
  • Burstein, J., Kukich, K., Wolff, S., Lu, C., Chodorow, M., Braden-Harder, L., & Harris, M. D. (1998). Automated scoring using a hybrid feature identification technique. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics, Montreal, Quebec, Canada. https://doi.org/10.3115/980845.980879
  • Büyüköztürk, Ş. (2011). Deneysel desenler- öntest-sontest kontrol grubu desen ve veri analizi. Ankara: Pegem Akademi Yayıncılık.
  • Carter, C., Bishop, J. L., & Kravits, S. L. (2002). Keys to college studying: becoming a lifelong learner. New Jersey: Printice Hall.
  • Çekici, Y. E. (2018). Türkçe’nin yabancı dil olarak öğretiminde kullanılan ders kitaplarında yazma görevleri: Yedi iklim ve İstanbul üzerine karşılaştırmalı bir inceleme. Gaziantep Üniversitesi Eğitim Bilimleri Dergisi, 2(1), 1-10. Retrieved from http://dergipark.gov.tr/http-dergipark-gov-tr-journal-1517-dashboard/issue/36422/367409
  • Chen, W. H., & Thissen, D. (1997). Local dependence indexes for item pairs using item response theory. Journal of Educational and Behavioral Statistics, 22(3), 265-289. https://doi.org/10.3102/10769986022003265
  • Congdon, P., & McQueen, J. (2000). The stability of rater severity in large-scale assessment programs. Journal of Educational Measurement, 37(2), 163-178. https://doi.org/10.1111/j.1745-3984.2000.tb01081.x
  • Cumming, A. (2013). Assessing integrated writing tasks for academic purposes: Promises and perils. Language Assessment Quarterly, 10(1), 1–8. https://doi.org/10.1080/15434303.2011.622016
  • Cumming, A. (2014). Assessing integrated skills. In A. Kunnan (Vol. Ed.), The companion to language assessment: Vol. 1, (pp. 216–229). Oxford, United Kingdom: Wiley-Blackwell. https://doi.org/10.1002/9781118411360.wbcla131
  • Dunbar, N.E., Brooks, C.F., & Miller, T.K. (2006). Oral communication skills in higher education: Using a performance-based evaluation rubric to assess communication skills. Innovative Higher Education, 31(2), 115-128. https://doi.org/10.1007/s10755-006-9012-x
  • Ebel, R.L., & Frisbie, D.A. (1991). Essentials of educational measurement. New Jersey: Prentice Hall Press.
  • Eckes, T. (2008). Rater types in writing performance assessments: A classification approach to rater variability. Language Testing, 25(2), 155–185. https://doi.org/10.1177/0265532207086780
  • Eckes, T. (2015). Introduction to many-facet Rasch measurement: Analyzing and evaluating rater-mediated assessments. Frankfurt: Peter Lang.
  • Ellis, R. O. D., Johnson, K. E., & Papajohn, D. (2002). Concept mapping for rater training. TESOL Quarterly, 36(2), 219-233. https://doi.org/10.2307/3588333
  • Engelhard Jr, G., & Myford, C. M. (2003). Monitoring faculty consultant performance in the advanced placement English Literature and composition program with a many‐faceted Rasch model. ETS Research Report Series, i-60. https://doi.org/10.1002/j.2333-8504.2003.tb01893.x
  • Engelhard, G. (2002). Monitoring raters in performance assessments. In G. Tindal and T. Haladyna (Eds.), Large-scale assessment programs for ALL students: Development, implementation, and analysis (pp. 261-287). Mahway, NJ: Lawrence Erlbaum Associates
  • Esfandiari, R. (2015). Rater errors among peer-assessors: applying the many-facet Rasch measurement model. Iranian Journal of Applied Linguistics, 18(2), 77-107. https://doi.org/10.18869/acadpub.ijal.18.2.77
  • Fahim, M., & Bijani, H. (2011). The effects of rater training on raters’ severity and bias in second language writing assessment. Iranian Journal of Language Testing, 1(1), 1-16. Retrieved from http://www.ijlt.ir/portal/files/401-2011-01-01.pdf
  • Farrokhi, F., Esfandiari, R., & Schaefer, E. (2012). A many-facet Rasch measurement of differential rater severity/leniency in three types of assessment. JALT Journal, 34(1), 79-101. Retrieved from https://jalt-publications.org/files/pdf-article/jj2012a-art4.pdf
  • Farrokhi, F., Esfandiari, R., & Vaez Dalili, M. (2011). Applying the many-facet Rasch model to detect centrality in self-assessment, peer-assessment and teacher assessment. World Applied Sciences Journal, 15(11), 76-83. Retrieved from https://pdfs.semanticscholar.org/dd21/ba5683dde8b616374876b0c53da376c10ca9.pdf
  • Feldman, M., Lazzara, E. H., Vanderbilt, A.A., & DiazGranados, D. (2012). Rater training to support high‐stakes simulation‐based assessments. Journal of Continuing Education in the Health Professions, 32(4), 279-286. https://doi.org/10.1002/chp.21156 Gillet, A., Hammond, A. & Martala, M. (2009). Successful academic writing. New York: Pearson Longman.
  • Göçer, A. (2010). Türkçe öğretiminde yazma eğitimi. Uluslararası Sosyal Araştırmalar Dergisi, 12(3), 178-195. Retrieved from http://www.sosyalarastirmalar.com/cilt3/sayi12pdf/gocer_ali.pdf
  • Goodrich, H. (1997). Understanding Rubrics: The dictionary may define" rubric," but these models provide more clarity. Educational Leadership, 54(4), 14-17.
  • Gronlund, N. E. (1977). Constructing achievement test. New Jersey: Prentice-Hall Press
  • Guadagnoli, E., & Velicer, W. F. (1988). Relation of sample size to the stability of component patterns. Psychological bulletin, 103(2), 265-275. https://doi.org/10.1037/0033-2909.103.2.265
  • Haladyna, T. M. (1997). Writing test items in order to evaluate higher order thinking. USA: Allyn & Bacon.
  • Hauenstein, N. M., & McCusker, M. E. (2017). Rater training: Understanding effects of training content, practice ratings, and feedback. International Journal of Selection and Assessment, 25(3), 253-266. https://doi.org/10.1111/ijsa.12177
  • Howitt, D., & Cramer, D. (2008). Introduction to statistics in psychology. Harlow: Pearson Education.
  • Hughes, A. (2003). Testing for language teachers. Cambridge: Cambridge University Press. IELTS (t.y). Prepare for IELTS. Retrieved from https://takeielts.britishcouncil.org/prepare-test/free-sample-tests/writing-sample-test-1-academic/writing-task-2
  • İlhan, M. (2015). Standart ve SOLO taksonomisine dayalı rubrikler ile puanlanan açık uçlu matematik sorularında puanlayıcı etkilerinin çok yüzeyli Rasch modeli ile incelenmesi. (Doktora Tezi). Retrieved from https://tez.yok.gov.tr
  • İlhan, M., & Çetin, B. (2014). Performans değerlendirmeye karışan puanlayıcı etkilerini azaltmanın yollarından biri olarak puanlayıcı eğitimleri: Kuramsal bir analiz. Journal of European Education, 4(2), 29-38. https://doi.org/10.18656/jee.77087
  • Jin, K. Y., & Wang, W. C. (2017). Assessment of differential rater functioning in latent classes with new mixture facets models. Multivariate behavioral research, 52(3), 391-402. https://doi.org/10.1080/00273171.2017.1299615
  • Johnson, R. L., Penny, J. A., & Gordon, B. (2008). Assessing performance: Designing, scoring, and validating performance tasks. New York: Guilford Press.
  • Kassim, N. L. A (2007). Exploring rater judging behaviour using the many-facet Rasch model. Paper Presented in the Second Biennial International Conference on Teaching and Learning of English in Asia: Exploring New Frontiers (TELiA2), Universiti Utara, Malaysia. Retrieved from http://repo.uum.edu.my/3212/
  • Kassim, N. L. A. (2011). Judging behaviour and rater errors: an application of the many-facet Rasch model. GEMA Online Journal of Language Studies, 11(3), 179-197. Retrieved from http://ejournals.ukm.my/gema/article/view/49
  • Kim, Y., Park, I., & Kang, M. (2012). Examining rater effects of the TGMD-2 on children with intellectual disability. Adapted Physical Activity Quarterly, 29(4), 346-365. https://doi.org/10.1123/apaq.29.4.346
  • Kim, Y.K. (2009). Combining constructed response items and multiple choice items using a hierarchical rater model (Doktora Tezi). Retrieved from http://www.proquest.com/
  • Kondo, Y. (2010). Examination of rater training effect and rater eligibility in L2 performance assessment. Journal of Pan-Pacific Association of Applied Linguistics, 14(2), 1-23. Retrieved from https://eric.ed.gov/?id=EJ920513
  • Kubiszyn, T., & Borich, G. (2013). Educational testing and measurement. New Jersey: John Wiley & Sons Incorporated. Kutlu, Ö., Doğan, C.D., & Karaya, İ. (2014). Öğrenci başarısının belirlenmesi: Performansa ve portfolyoya dayalı durum belirleme. Ankara: Pegem Akademi Yayıncılık.
  • Lawshe, C. H. (1975). A quantitative approach to content validity. Personnel psychology, 28(4), 563-575. https://doi.org/10.1111/j.1744-6570.1975.tb01393.x
  • Linacre, J. M. (1993). Rasch-based generalizability theory. Rasch Measurement Transaction, 7(1), 283-284. Retrieved from https://www.rasch.org/rmt/rmt71h.htm
  • Linacre, J. M. (1994). Many-facet Rasch measurement. Chicago: Mesa Press.
  • Linacre, J. M. (1996). Generalizability theory and many-facet Rasch measurement. Objective measurement: Theory into practice, 3, 85-98. Retrieved from https://files.eric.ed.gov/fulltext/ED364573.pdf
  • Linacre, J. M. (2017). A user’s guide to FACETS: Rasch-model computer programs. Chicago: MESA
  • Liu, J., & Xie, L. (2014). Examining rater effects in a WDCT pragmatics test. Iranian Journal of Language Testing, 4(1), 50-65. Retrieved from https://cdn.ov2.com/content/ijlte_1_ov2_com/wp-content_138/uploads/2019/07/422-2014-4-1.pdf
  • Lumley, T., & McNamara, T. F. (1995). Rater characteristics and rater bias: Implications for training. Language Testing, 12(1), 54-71. https://doi.org/10.1177/026553229501200104
  • Lunz, M. E., Wright, B. D. & Linacre, J. M. (1990). Measuring the impact of judge severity on examination scores. Applied Measurement in Education, 3(4), 331-345. https://doi.org/10.1207/s15324818ame0304_3
  • May, G. L. (2008). The effect of rater training on reducing social style bias in peer evaluation. Business Communication Quarterly, 71(3), 297-313. https://doi.org/10.1177/1080569908321431
  • McDonald, R. P. (1999). Test theory: A unified approach. Mahwah, NJ: Erlbaum.
  • McNamara, T. F. (1996). Measuring second language performance. New York: Longman.
  • Moore, B.B. (2009). Consideration of rater effects and rater design via signal detection theory. (Doktora Tezi). Retrieved from http://www.proquest.com/
  • Moser, K., Kemter, V., Wachsmann, K., Köver, N. Z., & Soucek, R. (2016). Evaluating rater training with double-pretest one-posttest designs: an analysis of testing effects and the moderating role of rater self-efficacy. The International Journal of Human Resource Management, 1-23. https://doi.org/10.1080/09585192.2016.1254102
  • Moskal, B.M. (2000). Scoring rubrics: What, when and how?. Retrieved from http://pareonline.net/htm/v7n3.htm
  • Murphy, K.R. & Balzer, W.K. (1989). Rater errors and rating accuracy. Journal of Applied Psychology, 74, 619-624. https://doi.org/10.1037/0021-9010.74.4.619
  • Myford, C. M., & Wolfe, E. W. (2003). Detecting and measuring rater effects using many-facet Rasch measurement: Part I. Journal of Applied Measurement, 4(4), 386-422. Retrieved from http://psycnet.apa.org/record/2003-09517-007
  • Oosterhof, A. (2003). Developing and using classroom assessments. New Jersey: Merrill-Prentice Hall Press.
  • Osburn, H. G. (2000). Coefficient alpha and related internal consistency reliability coefficients. Psychological methods, 5(3), 343. http://dx.doi.org/10.1037/1082-989X.5.3.343
  • Romagnano, L. (2001). The myth of objectivity in mathematics assessment. Mathematics Teacher, 94(1), 31-37. Retrieved from http://peterliljedahl.com/wp-content/uploads/Myth-of-Objectivity2.pdf
  • Schaefer, E. (2008). Rater bias patterns in an EFL writing assessment. Language Testing, 25(4), 465-493. https://doi.org/10.1177/0265532208094273
  • Selden, S., Sherrier, T., & Wooters, R. (2012). Experimental study comparing a traditional approach to performance appraisal training to a whole‐brain training method at CB Fleet Laboratories. Human Resource Development Quarterly, 23(1), 9-34. https://doi.org/10.1002/hrdq.21123
  • Shale, D. (1996). Essay reliability: Form and meaning. In: White, E. Lutz, W. & Kamusikiri S. (Eds.), Assessment of writing: Politics, policies, practices (pp. 76–96). New York: MLAA.
  • Stamoulis, D.T. & Hauenstein, N.M.A. (1993). Rater training and rating accuracy: Training for dimensional accuracy versus training for ratee differentiation. Journal of Applied Psychology, 78(6), 994-1003. https://doi.org/10.1037/0021-9010.78.6.994
  • Storch, N., & Tapper, J. (2009). The impact of an EAP course on postgraduate writing. Journal of English for Academic Purposes, 8, 207-223. https://doi.org/10.1016/j.jeap.2009.03.001
  • Sulsky, L.M., & Day, D.V. (1992). Frame-of-reference training and cognitive categorization: An empirical investigation of rater memory issues. Journal of Applied Psychology, 77(4), 501-510. https://doi.org/10.1037/0021-9010.77.4.501
  • Van Dyke, N. (2008). Self‐and peer‐assessment disparities in university ranking schemes. Higher Education in Europe, 33(2/3), 285-293. https://doi.org/10.1080/03797720802254114
  • Weigle, S. C. (1998). Using FACETS to model rater training effects. Language Testing, 15(2), 263-287. https://doi.org/10.1177/026553229801500205
  • Weigle, S. C. (2002). Assessing writing. Cambridge: Cambridge University Press. https://doi.org/10.1017/CBO9780511732997
  • Wesolowski, B. C., Wind, S. A., & Engelhard Jr, G. (2015). Rater fairness in music performance assessment: Evaluating model-data fit and differential rater functioning. Musicae Scientiae, 19(2), 147-170. https://doi.org/10.1177/1029864915589014
  • Wilson, F. R., Pan, W., & Schumsky, D. A. (2012). Recalculation of the critical values for Lawshe’s content validity ratio. Measurement and Evaluation in Counseling and Development, 45(3), 197-210. https://doi.org/10.1177/0748175612440286
  • Wind, S. A., & Guo, W. (2019). Exploring the combined effects of rater misfit and differential rater functioning in performance assessments. Educational and psychological measurement, 79(5), 962-987. https://doi.org/10.1177/0013164419834613
  • Woehr, D.J., & Huffuct, A.I. (1994). Rater training for performance appraisal. A qantitative review. Journal of Occupational and Organizational Psychology, 67(3), 189-205. https://doi.org/10.1111/j.2044-8325.1994.tb00562.x
  • Wolfe, E. W., & McVay, A. (2012). Application of latent trait models to identifying substantively interesting raters. Educational Measurement: Issues and Practice, 31(3), 31-37. https://doi.org/10.1111/j.1745-3992.2012.00241.x
  • Yan, X. (2014). An examination of rater performance on a local oral English proficiency test: A mixed-methods approach. Language Testing, 31(4), 501-527. https://doi.org/10.1177/0265532214536171
  • Zedeck, S., & Cascio, W. F. (1982). Performance appraisal decisions as a function of rater training and purpose of the appraisal. Journal of Applied Psychology, 67(6), 752-758. https://doi.org/10.1037/0021-9010.67.6.752
  • Zwiers, J. (2008). Building academic language: Essential practices for content classrooms. San Francisco: Jossey-Bass.