Investigating the Effect of Rater Training on Differential Rater Function in Assessing Academic Writing Skills of Higher Education Students

This study aimed to examine the effect of rater training on the differential rater function (rater error) in the process of assessing the academic writing skills of higher education students. The study was conducted with a pre-test and post-test control group quasi-experimental design. The study group of the research consisted of 45 raters, of whom 22 came from experimental, and 23 came from control groups. The raters were pre-service teachers who did not participate in any rater training before, and it was investigated that they had similar experiences in assessment. The data were collected using an analytical rubric developed by the researchers and an opinion-based writing task prepared by the International English Language Testing System (IELTS). Within the scope of the research, the compositions of 39 students that were written in a foreign language (English) were assessed. Many Facet Rasch Model was used for the analysis of the data, and this analysis was conducted under the Fully Crossed Design. The findings of the study revealed that the given rater training was effective on differential rater function, and suggestions based on these results were presented.

Keywords:

Academic writing many facet Rasch model, rater training,

PDF

___

Aryadoust, V. (2016). Understanding the growth of ESL paragraph writing skills and its relationships with linguistic features. Educational Psychology, 36(10), 1742-1770. https://doi.org/10.1080/01443410.2014.950946
Attali, Y., Bridgeman, B., & Trapani, C. (2010). Performance of a generic approach in automated essay scoring. Journal of Technology, Learning, and Assessment, 10(3), 1-16. Retrieved from https://ejournals.bc.edu/ojs/index.php/jtla/article/view/1603
Baştürk, M. (2012). İkinci dil öğrenme algılarının belirlenmesi: Balıkesir örneği. Balikesir University Journal of Social Sciences Institute, 15(28-1), 251-270. Retrieved from http://dspace.balikesir.edu.tr/xmlui/handle/20.500.12462/4594
Bayat, N. (2014). Öğretmen adaylarının eleştirel düşünme düzeyleri ile akademik yazma başarıları arasındaki ilişki. Eğitim ve Bilim, 39(173), 155-168. Retrieved from http://eb.ted.org.tr/index.php/EB/article/view/2333
Bernardin, H. J. & Pence, E. C. (1980). Effects of rater training: New response sets and decreasing accuracy. Journal ofApplied Psychology, 65, 60-66. https://doi.org/10.1037/0021-9010.65.1.60
Bernardin, H. J., & Buckley, M. R. (1981). Strategies in rater training. Academy of Management Review, 6(2), 205-212. Retrieved from https://journals.aom.org/doi/abs/10.5465/amr.1981.4287782
Bijani, H. (2018). Investigating the validity of oral assessment rater training program: A mixed-methods study of raters’ perceptions and attitudes before and after training. Cogent Education, 5(1), 1-20. https://doi.org/10.1080/2331186X.2018.1460901
Bitchener, J., Young, S., & Cameron, D. (2005). The effect of different types of corrective feedback on ESL students. Journal of Second Language Writing, 14, 191–205. https://doi.org/10.1016/j.jslw.2005.08.001
Bond, T., & Fox, C. M. (2015). Applying the Rasch model: Fundamental measurement in the human sciences. New York and London: Routledge. https://doi.org/10.4324/9781315814698
Brennan, R.L., Gao, X., & Colton, D.A. (1995). Generalizability analyses of work key listening and writing tests. Educational and Psychological Measurement, 55(2), 157-176. https://doi.org/10.1177/0013164495055002001
Brijmohan, A. (2016). A many-facet Rasch measurement analysis to explore rater effects and rater training in medical school admissions. (Doktora Tezi). Retrieved from http://www.proquest.com/
Brookhart, S. M. (2013). How to create and use rubrics for formative assessment and grading. Alexandria, Virginia: ASCD.
Brown, H. D. (2007). Teaching by principles: An interactive approach to language pedagogy. New York: Pearson Education.
Brown, J. D., & Hudson, T. (1998). The alternatives in language assessment. TESOL quarterly, 32(4), 653-675. https://doi.org/10.2307/3587999
Burstein, J., Kukich, K., Wolff, S., Lu, C., Chodorow, M., Braden-Harder, L., & Harris, M. D. (1998). Automated scoring using a hybrid feature identification technique. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics, Montreal, Quebec, Canada. https://doi.org/10.3115/980845.980879
Büyüköztürk, Ş. (2011). Deneysel desenler- öntest-sontest kontrol grubu desen ve veri analizi. Ankara: Pegem Akademi Yayıncılık.
Carter, C., Bishop, J. L., & Kravits, S. L. (2002). Keys to college studying: becoming a lifelong learner. New Jersey: Printice Hall.
Çekici, Y. E. (2018). Türkçe’nin yabancı dil olarak öğretiminde kullanılan ders kitaplarında yazma görevleri: Yedi iklim ve İstanbul üzerine karşılaştırmalı bir inceleme. Gaziantep Üniversitesi Eğitim Bilimleri Dergisi, 2(1), 1-10. Retrieved from http://dergipark.gov.tr/http-dergipark-gov-tr-journal-1517-dashboard/issue/36422/367409
Chen, W. H., & Thissen, D. (1997). Local dependence indexes for item pairs using item response theory. Journal of Educational and Behavioral Statistics, 22(3), 265-289. https://doi.org/10.3102/10769986022003265
Congdon, P., & McQueen, J. (2000). The stability of rater severity in large-scale assessment programs. Journal of Educational Measurement, 37(2), 163-178. https://doi.org/10.1111/j.1745-3984.2000.tb01081.x
Cumming, A. (2013). Assessing integrated writing tasks for academic purposes: Promises and perils. Language Assessment Quarterly, 10(1), 1–8. https://doi.org/10.1080/15434303.2011.622016
Cumming, A. (2014). Assessing integrated skills. In A. Kunnan (Vol. Ed.), The companion to language assessment: Vol. 1, (pp. 216–229). Oxford, United Kingdom: Wiley-Blackwell. https://doi.org/10.1002/9781118411360.wbcla131
Dunbar, N.E., Brooks, C.F., & Miller, T.K. (2006). Oral communication skills in higher education: Using a performance-based evaluation rubric to assess communication skills. Innovative Higher Education, 31(2), 115-128. https://doi.org/10.1007/s10755-006-9012-x
Ebel, R.L., & Frisbie, D.A. (1991). Essentials of educational measurement. New Jersey: Prentice Hall Press.
Eckes, T. (2008). Rater types in writing performance assessments: A classification approach to rater variability. Language Testing, 25(2), 155–185. https://doi.org/10.1177/0265532207086780
Eckes, T. (2015). Introduction to many-facet Rasch measurement: Analyzing and evaluating rater-mediated assessments. Frankfurt: Peter Lang.
Ellis, R. O. D., Johnson, K. E., & Papajohn, D. (2002). Concept mapping for rater training. TESOL Quarterly, 36(2), 219-233. https://doi.org/10.2307/3588333
Engelhard Jr, G., & Myford, C. M. (2003). Monitoring faculty consultant performance in the advanced placement English Literature and composition program with a many‐faceted Rasch model. ETS Research Report Series, i-60. https://doi.org/10.1002/j.2333-8504.2003.tb01893.x
Engelhard, G. (2002). Monitoring raters in performance assessments. In G. Tindal and T. Haladyna (Eds.), Large-scale assessment programs for ALL students: Development, implementation, and analysis (pp. 261-287). Mahway, NJ: Lawrence Erlbaum Associates
Esfandiari, R. (2015). Rater errors among peer-assessors: applying the many-facet Rasch measurement model. Iranian Journal of Applied Linguistics, 18(2), 77-107. https://doi.org/10.18869/acadpub.ijal.18.2.77
Fahim, M., & Bijani, H. (2011). The effects of rater training on raters’ severity and bias in second language writing assessment. Iranian Journal of Language Testing, 1(1), 1-16. Retrieved from http://www.ijlt.ir/portal/files/401-2011-01-01.pdf
Farrokhi, F., Esfandiari, R., & Schaefer, E. (2012). A many-facet Rasch measurement of differential rater severity/leniency in three types of assessment. JALT Journal, 34(1), 79-101. Retrieved from https://jalt-publications.org/files/pdf-article/jj2012a-art4.pdf
Farrokhi, F., Esfandiari, R., & Vaez Dalili, M. (2011). Applying the many-facet Rasch model to detect centrality in self-assessment, peer-assessment and teacher assessment. World Applied Sciences Journal, 15(11), 76-83. Retrieved from https://pdfs.semanticscholar.org/dd21/ba5683dde8b616374876b0c53da376c10ca9.pdf
Feldman, M., Lazzara, E. H., Vanderbilt, A.A., & DiazGranados, D. (2012). Rater training to support high‐stakes simulation‐based assessments. Journal of Continuing Education in the Health Professions, 32(4), 279-286. https://doi.org/10.1002/chp.21156 Gillet, A., Hammond, A. & Martala, M. (2009). Successful academic writing. New York: Pearson Longman.
Göçer, A. (2010). Türkçe öğretiminde yazma eğitimi. Uluslararası Sosyal Araştırmalar Dergisi, 12(3), 178-195. Retrieved from http://www.sosyalarastirmalar.com/cilt3/sayi12pdf/gocer_ali.pdf
Goodrich, H. (1997). Understanding Rubrics: The dictionary may define" rubric," but these models provide more clarity. Educational Leadership, 54(4), 14-17.
Gronlund, N. E. (1977). Constructing achievement test. New Jersey: Prentice-Hall Press
Guadagnoli, E., & Velicer, W. F. (1988). Relation of sample size to the stability of component patterns. Psychological bulletin, 103(2), 265-275. https://doi.org/10.1037/0033-2909.103.2.265
Haladyna, T. M. (1997). Writing test items in order to evaluate higher order thinking. USA: Allyn & Bacon.
Hauenstein, N. M., & McCusker, M. E. (2017). Rater training: Understanding effects of training content, practice ratings, and feedback. International Journal of Selection and Assessment, 25(3), 253-266. https://doi.org/10.1111/ijsa.12177
Howitt, D., & Cramer, D. (2008). Introduction to statistics in psychology. Harlow: Pearson Education.
Hughes, A. (2003). Testing for language teachers. Cambridge: Cambridge University Press. IELTS (t.y). Prepare for IELTS. Retrieved from https://takeielts.britishcouncil.org/prepare-test/free-sample-tests/writing-sample-test-1-academic/writing-task-2
İlhan, M. (2015). Standart ve SOLO taksonomisine dayalı rubrikler ile puanlanan açık uçlu matematik sorularında puanlayıcı etkilerinin çok yüzeyli Rasch modeli ile incelenmesi. (Doktora Tezi). Retrieved from https://tez.yok.gov.tr
İlhan, M., & Çetin, B. (2014). Performans değerlendirmeye karışan puanlayıcı etkilerini azaltmanın yollarından biri olarak puanlayıcı eğitimleri: Kuramsal bir analiz. Journal of European Education, 4(2), 29-38. https://doi.org/10.18656/jee.77087
Jin, K. Y., & Wang, W. C. (2017). Assessment of differential rater functioning in latent classes with new mixture facets models. Multivariate behavioral research, 52(3), 391-402. https://doi.org/10.1080/00273171.2017.1299615
Johnson, R. L., Penny, J. A., & Gordon, B. (2008). Assessing performance: Designing, scoring, and validating performance tasks. New York: Guilford Press.
Kassim, N. L. A (2007). Exploring rater judging behaviour using the many-facet Rasch model. Paper Presented in the Second Biennial International Conference on Teaching and Learning of English in Asia: Exploring New Frontiers (TELiA2), Universiti Utara, Malaysia. Retrieved from http://repo.uum.edu.my/3212/
Kassim, N. L. A. (2011). Judging behaviour and rater errors: an application of the many-facet Rasch model. GEMA Online Journal of Language Studies, 11(3), 179-197. Retrieved from http://ejournals.ukm.my/gema/article/view/49
Kim, Y., Park, I., & Kang, M. (2012). Examining rater effects of the TGMD-2 on children with intellectual disability. Adapted Physical Activity Quarterly, 29(4), 346-365. https://doi.org/10.1123/apaq.29.4.346
Kim, Y.K. (2009). Combining constructed response items and multiple choice items using a hierarchical rater model (Doktora Tezi). Retrieved from http://www.proquest.com/
Kondo, Y. (2010). Examination of rater training effect and rater eligibility in L2 performance assessment. Journal of Pan-Pacific Association of Applied Linguistics, 14(2), 1-23. Retrieved from https://eric.ed.gov/?id=EJ920513
Kubiszyn, T., & Borich, G. (2013). Educational testing and measurement. New Jersey: John Wiley & Sons Incorporated. Kutlu, Ö., Doğan, C.D., & Karaya, İ. (2014). Öğrenci başarısının belirlenmesi: Performansa ve portfolyoya dayalı durum belirleme. Ankara: Pegem Akademi Yayıncılık.
Lawshe, C. H. (1975). A quantitative approach to content validity. Personnel psychology, 28(4), 563-575. https://doi.org/10.1111/j.1744-6570.1975.tb01393.x
Linacre, J. M. (1993). Rasch-based generalizability theory. Rasch Measurement Transaction, 7(1), 283-284. Retrieved from https://www.rasch.org/rmt/rmt71h.htm
Linacre, J. M. (1994). Many-facet Rasch measurement. Chicago: Mesa Press.
Linacre, J. M. (1996). Generalizability theory and many-facet Rasch measurement. Objective measurement: Theory into practice, 3, 85-98. Retrieved from https://files.eric.ed.gov/fulltext/ED364573.pdf
Linacre, J. M. (2017). A user’s guide to FACETS: Rasch-model computer programs. Chicago: MESA
Liu, J., & Xie, L. (2014). Examining rater effects in a WDCT pragmatics test. Iranian Journal of Language Testing, 4(1), 50-65. Retrieved from https://cdn.ov2.com/content/ijlte_1_ov2_com/wp-content_138/uploads/2019/07/422-2014-4-1.pdf
Lumley, T., & McNamara, T. F. (1995). Rater characteristics and rater bias: Implications for training. Language Testing, 12(1), 54-71. https://doi.org/10.1177/026553229501200104
Lunz, M. E., Wright, B. D. & Linacre, J. M. (1990). Measuring the impact of judge severity on examination scores. Applied Measurement in Education, 3(4), 331-345. https://doi.org/10.1207/s15324818ame0304_3
May, G. L. (2008). The effect of rater training on reducing social style bias in peer evaluation. Business Communication Quarterly, 71(3), 297-313. https://doi.org/10.1177/1080569908321431
McDonald, R. P. (1999). Test theory: A unified approach. Mahwah, NJ: Erlbaum.
McNamara, T. F. (1996). Measuring second language performance. New York: Longman.
Moore, B.B. (2009). Consideration of rater effects and rater design via signal detection theory. (Doktora Tezi). Retrieved from http://www.proquest.com/
Moser, K., Kemter, V., Wachsmann, K., Köver, N. Z., & Soucek, R. (2016). Evaluating rater training with double-pretest one-posttest designs: an analysis of testing effects and the moderating role of rater self-efficacy. The International Journal of Human Resource Management, 1-23. https://doi.org/10.1080/09585192.2016.1254102
Moskal, B.M. (2000). Scoring rubrics: What, when and how?. Retrieved from http://pareonline.net/htm/v7n3.htm
Murphy, K.R. & Balzer, W.K. (1989). Rater errors and rating accuracy. Journal of Applied Psychology, 74, 619-624. https://doi.org/10.1037/0021-9010.74.4.619
Myford, C. M., & Wolfe, E. W. (2003). Detecting and measuring rater effects using many-facet Rasch measurement: Part I. Journal of Applied Measurement, 4(4), 386-422. Retrieved from http://psycnet.apa.org/record/2003-09517-007
Oosterhof, A. (2003). Developing and using classroom assessments. New Jersey: Merrill-Prentice Hall Press.
Osburn, H. G. (2000). Coefficient alpha and related internal consistency reliability coefficients. Psychological methods, 5(3), 343. http://dx.doi.org/10.1037/1082-989X.5.3.343
Romagnano, L. (2001). The myth of objectivity in mathematics assessment. Mathematics Teacher, 94(1), 31-37. Retrieved from http://peterliljedahl.com/wp-content/uploads/Myth-of-Objectivity2.pdf
Schaefer, E. (2008). Rater bias patterns in an EFL writing assessment. Language Testing, 25(4), 465-493. https://doi.org/10.1177/0265532208094273
Selden, S., Sherrier, T., & Wooters, R. (2012). Experimental study comparing a traditional approach to performance appraisal training to a whole‐brain training method at CB Fleet Laboratories. Human Resource Development Quarterly, 23(1), 9-34. https://doi.org/10.1002/hrdq.21123
Shale, D. (1996). Essay reliability: Form and meaning. In: White, E. Lutz, W. & Kamusikiri S. (Eds.), Assessment of writing: Politics, policies, practices (pp. 76–96). New York: MLAA.
Stamoulis, D.T. & Hauenstein, N.M.A. (1993). Rater training and rating accuracy: Training for dimensional accuracy versus training for ratee differentiation. Journal of Applied Psychology, 78(6), 994-1003. https://doi.org/10.1037/0021-9010.78.6.994
Storch, N., & Tapper, J. (2009). The impact of an EAP course on postgraduate writing. Journal of English for Academic Purposes, 8, 207-223. https://doi.org/10.1016/j.jeap.2009.03.001
Sulsky, L.M., & Day, D.V. (1992). Frame-of-reference training and cognitive categorization: An empirical investigation of rater memory issues. Journal of Applied Psychology, 77(4), 501-510. https://doi.org/10.1037/0021-9010.77.4.501
Van Dyke, N. (2008). Self‐and peer‐assessment disparities in university ranking schemes. Higher Education in Europe, 33(2/3), 285-293. https://doi.org/10.1080/03797720802254114
Weigle, S. C. (1998). Using FACETS to model rater training effects. Language Testing, 15(2), 263-287. https://doi.org/10.1177/026553229801500205
Weigle, S. C. (2002). Assessing writing. Cambridge: Cambridge University Press. https://doi.org/10.1017/CBO9780511732997
Wesolowski, B. C., Wind, S. A., & Engelhard Jr, G. (2015). Rater fairness in music performance assessment: Evaluating model-data fit and differential rater functioning. Musicae Scientiae, 19(2), 147-170. https://doi.org/10.1177/1029864915589014
Wilson, F. R., Pan, W., & Schumsky, D. A. (2012). Recalculation of the critical values for Lawshe’s content validity ratio. Measurement and Evaluation in Counseling and Development, 45(3), 197-210. https://doi.org/10.1177/0748175612440286
Wind, S. A., & Guo, W. (2019). Exploring the combined effects of rater misfit and differential rater functioning in performance assessments. Educational and psychological measurement, 79(5), 962-987. https://doi.org/10.1177/0013164419834613
Woehr, D.J., & Huffuct, A.I. (1994). Rater training for performance appraisal. A qantitative review. Journal of Occupational and Organizational Psychology, 67(3), 189-205. https://doi.org/10.1111/j.2044-8325.1994.tb00562.x
Wolfe, E. W., & McVay, A. (2012). Application of latent trait models to identifying substantively interesting raters. Educational Measurement: Issues and Practice, 31(3), 31-37. https://doi.org/10.1111/j.1745-3992.2012.00241.x
Yan, X. (2014). An examination of rater performance on a local oral English proficiency test: A mixed-methods approach. Language Testing, 31(4), 501-527. https://doi.org/10.1177/0265532214536171
Zedeck, S., & Cascio, W. F. (1982). Performance appraisal decisions as a function of rater training and purpose of the appraisal. Journal of Applied Psychology, 67(6), 752-758. https://doi.org/10.1037/0021-9010.67.6.752
Zwiers, J. (2008). Building academic language: Essential practices for content classrooms. San Francisco: Jossey-Bass.