Measuring Essay Assessment: Intra-rater and Inter-rater Reliability

Problem Durumu: Yazma becerisinin etkili bir biçimde puanlanmasının araştırılmasına ilişkin bir hayli çaba gösterilmekte ve birçok öneri sunulmaktadır. Bu bağlamda, puanlayıcı güvenirliği, bireylerin gerek eğitim gerekse mesleki yaşamlarının farklı dönüm noktalarında hayati kararlar vermede çok önemli rol oynamaktadır. Aynı ve farklı puanlayıcıların farklı puanlama araçları kullanarak yaptıkları puanlamaların da güvenirlikleri puanlama süreçleri ile birlikte tartışılmalıdır. Araştırmanın Amacı: Araştırmanın amacı İngilizce öğrenicilerinin yazma becerilerinin aynı/farklı puanlayıcılar tarafından genel izlenim (GIM), kontrol listesi (ECC) ve kompozisyon puanlama ölçeği (ESAS) kullanılarak değerlendirilmesindeki olası farklılık ve tutarlılıkları ortaya çıkarmak ve puanlayıcı güvenirliklerini tartışmaktır. Yöntem: Ölçme sonuçlarının tutarlılığı ve puanlamaların güvenirliğine ilişkin yorum ve tartışmaların yapılabilmesi için nicel ve nitel veriler kullanılmıştır. Puanlama araçları 44 üniversite öğrencisi üzerinde uygulanmış ve 10 puanlayıcı genel izlenim, kontrol listesi ve ölçek kullanarak bu öğrencilerin yazma becerilerini puanlamışlardır. Bulgular: Bulgular ve analiz sonuçları genel izlenimle puanlamanın beklendiği üzere kesinlikle güvenilir olmadığını göstermiştir. Elde edilen korelasyon katsayıları, varyans kestirimleri ve genellenebilirlik katsayilarindan elde edilen bilgiler goz onune alindiginda, puanların aynı olmadığı ve sonuçlar arasında daima bir çeşitlilik ve varyasyon olduğu görülmektedir. Sonuç ve Öneriler: Toplam puanlar ve puanlayıcıların vermiş oldukları puanlar arasındaki tutarlılıklar incelendiğinde sonuçların, korelasyon katsayıları yüksek ve anlamlı olsa dahi, çoğu zaman aynı olmadığı ve birbirlerinden farklı oldukları görülmüştür. Bu yüzden, yaygın kanının aksine, kontrol listeleri ve ölçekler, puanlayıcıların söz konusu araçlara yönelik iyi bir eğitim almamaları ve ölçütler, ölçüt tanımları ve performans göstergeleri üzerinde bir uzlaşma sağlamadıkları takdirde beklendiği gibi etkili bir şekilde güvenilir olamayabilmektedirler.Bu tür ölçmelerde anlamlı kabul edilecek korelasyon katsayısınınn en az .90 düzeyinde olması durumunda güvenilir puanlamaya kanıt oluşturacak olan sonuçlar daha hatasız olabilir. Bu durum aynı ve farklı yazılı yoklamalara verilen puanların birbirlerine olan yakınlık düzeylerini artıracak ve daha benzer sonuçların ortaya çıkması anlamına gelebilecektir. Herşeye rağmen, hali hazırdaki durum ve sonuçlar gözönüne alındığında ölçek kullanımının diğer puanlama araçlarına göre daha güvenilir olduğu vurgulanabilir. Yine de çalışmanın daha fazla puanlayıcı ile tekrarlanmasınin alana katkı sağlayacağı düşünülmektedir.

Kompozisyon Puanlamanın Ölçülmesi: Aynı ve Farklı Puanlayıcı Güvenirliği

Problem Statement: There have been many attempts to research the effective assessment of writing ability, and many proposals for how this might be done. In this sense, rater reliability plays a crucial role for making vital decisions about testees in different turning points of both educational and professional life. Intra-rater and inter-rater reliability of essay assessments made by using different assessing tools should also be discussed with the assessment processes. Purpose of Study: The purpose of the study is to reveal possible variation or consistency in grading essay writing ability of EFL writers by the same/different raters using general impression marking (GIM), essay criteria checklist (ECC), and essay assessment scale (ESAS), and discuss rater reliability. Methods: Quantitative and qualitative data were used to present the discussion and implications for the reliability of ratings and the consistency of the measurement results. The assessing tools were applied to 44 EFL university students and 10 graders assessed the essay writing ability of the students by using GIM, ECC, and ESAS in different occasions. Findings and Results: The findings and results of the analyses indicated that using general impression marking is evidently not reliable for assessing essays. The coefficients obtained from checklist and scale assessments, considering the correlation coefficients, estimated variance components, and generalizability coefficients present valuable information, clearly show that there is always variation among the results. Conclusions and Recommendations: When the total scores and the rater consensus results in this study are examined, it can be clearly seen that the scores are almost always not identical and they are different from each other. For this reason, opposed to the idea that is commonly agreed upon, checklists or even scales may not be effectively as reliable as expected and they may not improve inter-reliability or intra-reliability of ratings unless the raters are very well-trained and they have strong agreement or common inferences on performance indicators and descriptors since they should not have ambiguous interpretations on the criteria set. The results might be more accurate and reliable if the accepted interpretation of a meaningful correlation coefficient for this kind of measurements can be considered as .90 minimum for giving evidence of reliable ratings. This might mean that the proximity of the scores which are assigned to same or independent essays will be higher and more similar. However, the scale use could still be emphasized as more reliable. Still, an elaborate and careful examination with more raters is seen needed.

___

  • Blok, H. (1985). Estimating the reliability, validity, and invalidity of essay ratings.
  • Journal of Educational Measurement, 22(1), 41-52. Bowen, K. and Cali, K. (2004). Teaching the features of effective writing. Retrieved November 21, 2004, from http://www.learnnc.org/index.nsf/printView/1216418CB65B73CE85256D73 C5A?OpenDocument.
  • Breland, H. (1983). The direct assessment of writing skill: A measurement review
  • (Technical Report No.83-6). Princeton, NJ: College Entrance Examination Board. Celce-Murcia, M. (2001). Teaching English as a second or foreign language.
  • Massachusetts: Heinle and Heinle. Chase, C. I. (1983). Essay test scores and reading difficulty. Journal of Educational Measurement, 20(3), 293-297.
  • Chase, C. I. (1968). The impact of some obvious variables on essay test scores. Journal of Educational Measurement, 2(4), 315-318.
  • Cherry, R. and Meyer, P. (1993). Reliability issues in holistic assessment. In M.
  • Williamson and B. Huot (Ed.), Validating holistic scoring for writing assessment: Theoretical and empirical foundations (pp. 109-141). Cresskill, NJ: Hampton. Darus, S. (2006). Identifying dimensions and attributes of writing proficiency: development of a framework of a computer-based essay marking system for
  • Malaysian ESL learners. Internet Journal of e-Learning and Teaching, 3(1), 1-25. Dempsey, M. S., PytlikZillig, L. M., and Bruning, R. G. (2009). Helping preservice teachers learn to assess writing: Practice and feedback in a web-based environment. Assessing Writing, 14, 38-61.
  • East, M. (2009). Evaluating the reliability of a detailed analytic scoring rubric for foreign language writing. Assessing Writing, 14, 88-115.
  • Engelhard, G. (1994). Examining rater errors in the assessment of written composition with a many-faceted rasch model. Journal of Educational Measurement, 31(2), 93-112.
  • Erkuş A. (2003). Psikometri üzerine yazilar: ölçme ve psikometrinin tarihsel kökenleri, güvenirlik, geçerlik, madde analizi, tutumlar; bileşenleri ve ölçülmesi [Writings on
  • Pscychometrics: historical basis for measurement and pscyhometrics, reliability, validity, item analysis, attitudes; components and measurement]. 1. baskı, Ankara. Türk Psikologlar Derneği Yayınları. Fisher, R., Brooks, G., and Lewis, M. (2002). Raising standards in literacy. New York: Routledge.
  • Glesne, C. (1999). Becoming qualitative researchers: An introduction. New York: Longman.
  • Gyagenda, I. S. and Engelhard, G. (1998a). Rater, domain, and gender influences on the assessed quality of student writing using weighted and unweighted scoring. Annual Meeting of the American Educational Research Association. San Diego.
  • Gyagenda, I. S. and Engelhard, G. (1998b). Applying the Rasch model to explore rater influences on the assessed quality of students' writing ability. Annual Meeting of the American Educational Research Association. San Diego.
  • Hamp-Lyons, L. (1991). The writer's knowledge and our knowledge of the writer. In
  • L. Hamp-Lyons (Ed.), Assessing second language writing in academic contexts (p. 36). Norwood, NJ: Ablex. Hamp-Lyons, L. (1992). Holistic writing assessment for LEP students. Second National
  • Research Symposium on Limited English Proficient Student Studies: Focus on Evaluation and Measurement. Washington. Hawkey, R. and Barker, F. (2004). Developing a common scale for the assessment of writing. Assessing Writing, 9, 122-159.
  • Herrington, A. and Moran, C. (2001). What happens when machines read our students'writing?. College English, 63, 480-499.
  • Huang, J. (2012). Using generalizability theory to examine the accuracy and validity of large-scale ESL writing assessment. Assessing Writing 17, 123-139.
  • Hughes, D. and Keeling, B. (1984). The use of model essays to reduce context effects in essay scoring. Journal of Educational Measurement, 21(3), 277-281.
  • Hughes, D., Keeling, B., and Tuck, B. F. (1983). Effects of achievement expectations and handwriting quality on scoring essays. Journal of Educational Measurement, (1), 65-70.
  • Hughes, D., Keeling, B., and Tuck, B. F. (1980). The influence of context position and scoring method on essay scoring. Journal of Educational Measurement, 17(2), 135.
  • Huot, B. (1990). The literature of direct writing assessment: Major concerns and prevailing trends. Review of Educational Research, 60, 237-263. IELTS (2007). IELTS handbook. Retrieved January , from http://www.ielts.org/_lib/pdf/IELTS_ Handbook_2007 .pdf#.
  • Jacobs, H. L., Zinkgraf, S. A., Wormuth, D. R., Hartfiel, V. F., and Hughey, J. B. (1981). Testing ESL composition: A practical approach. Rowley, MA: Newbury House.
  • Johnson, R., Penny, J., and Gordon, B. (2001). Score resolution and the interrater reliability of holistic scores in rating essays. Written Communication, 18(2), 229
  • Kan, A. (2005). Yazılı yoklamaların puanlanmasında puanlama cetveli ve yanıt anahtarı kullanımının (aynı) puanlayıcı güvenirliğine etkisi [The effect of checklist and answer key use in writing assessment on rater reliability]. Eğitim
  • Araştırmaları Dergisi, 5(20), 166-177. Kan, A. (2007). Performans değerlendirme sürecine katkıları açısından yeni program anlayışı içerisinde kullanılabilecek bir değerlendirme yaklaşımı: Rubrik puanlama yönergeleri [An evaluation approach to be used for the new curriculum considering the contributions to performance evaluation process:
  • Rubrics]. Kuram ve Uygulamada Eğitim Bilimleri, 7(1), 129-152. Kayapinar, U. (2010). A study on assessment tools and evaluation of essay writing skill in foreign language education. Unpublished PhD Dissertation, Mersin
  • University. Yenisehir Campus: Turkey. Klein, S. and Hart, F. M. (1968). Chance and systematic factors affecting essay grades.
  • Journal of Educational Measurement, 5(3), 197-206. Klein, J. and Taub, D. (2005). The effect of variations in handwriting and print on evaluation of student essays. Assessing Writing, 10, 134-148.
  • Kline, P. (1986). A handbook of test construction: introduction to psychometric design. London: Methuen.
  • Knoch, U. (2009). Diagnostic assessment of writing: A comparison of two rating scales. Language Testing, 26, 275-304.
  • Marshall, J. C. and Powers, J. M. (1969). Writing neatness, composition errors, and essay grades. Journal of Educational Measurement, 6(2), 97-101.
  • Mertler, C. A. (2001). Designing scoring rubrics for your classroom. Practical Assessment,
  • Research and Evaluation, 7(25). Retrieved October 11, 2007 from http://PAREonline.net/getvn.asp ?v=7andn=25.
  • Miles, M. B. and Huberman, A. M. (1994). Qualitative data analysis. California: Sage Publications.
  • Moskal, B. M. (2000). Scoring rubrics: what, when, and how?. Practical Assessment,
  • Research, And Evaluation, 7(3). Retrieved October 11, 2007 from http://pareonline.net/getvn.asp?v=7&n=3.
  • Murphy, K. R. and Balzer, W. K. (1989). Rater errors and rating accuracy. Journal of
  • Applied Psychology, 74(4), 619-624. Nitko, A. J. (2001). Educational assessment of students (3rd ed.). Upper Saddle River, NJ: Merrill.
  • Norton, L. S. (1990). Essay-writing: What really counts. Higher Education, 20(4), 411
  • Patton, M. Q. (2002). Qualitative research and evaluation methods. California: Sage Publications.
  • Raimes, A. (1983). Techniques in teaching writing. Oxford: Oxford University Press.
  • Schaefer, E. (2008). Rater bias patterns in an EFL writing assessment. Language Testing, 25(4), 465-493.
  • Slomp, D. H. (2012). Challenges in assessing the development of writing ability:
  • Theories, constructs and methods. Assessing Writing 17, 81-91. Strauss, A. and Corbin, J. (1998). Basics of qualitative research: Techniques and procedures for developing grounded theory. California: Sage Publications.
  • Sulsky, L. M. and Balzer W. K. (1988). Meaning and measurement of performance rating accuracy: Some methodological and theoretical concerns. Journal of Applied Psychology, 73,497-506.
  • Vaughan C. (1991). Holistic assessment: What goes on in the rater's mind? L. Hamp
  • Lyons (Ed.), In Assessing Second Language Writing in Academic Contexts (p. 111- ). Norwood, NJ: Ablex. Weir, C. J. (1990). Communicative language testing. Wiltshire: Prentice Hall.
  • Wexley, K.N. and Youtz, M.A. (1985). Rater beliefs about others: Their effects on rating errors and rater accuracy. Journal of Occupational Psychology, 58, 265-275.
  • Woehr, D. J. and Huffcutt, A. I. (1994). Rater training for performance appraisal: A quantitative review. Journal of Occupational and Organizational Psychology, 67, 205.