Mehmet SATA, Ismail KARAKAYA, Aslihan ERMAN ASLANOGLU

Üniversite Öğrencilerinin Öz ve Akran Puanlama Sürecinde Puanlama Davranışlarının Many Facet Rasch Modeli ile İncelenmesi

Problem Durumu: Yükseköğretimin temel amacının, öğrencileri, kendi mesleki uygulamaları üzerinde eleştirel düşünen, problem çözen, yansıtıcı uygulayıcılar haline getirmelerine destek vermeye yöneldiği açıktır (Falchikov & Goldfinch, 2000; Kwan & Leung, 1996). Bireylerin bu becerileri kazanması ve geliştirmesi öğretim programlarının da odak noktası haline gelmiştir. Dolayısıyla öğretim programlarının belirtilen bu becerileri izlemesi ve değerlendirmesi söz konusudur. Bu amaç için uygulanan klasik ölçme araçları sözü edilen özelliklerin ölçülmesinde yetersiz kalmaktadır. Bu yeni anlayış öğrenme sürecinin de değerlendirilmeye öğrencilerin katılmasını önemli görmektedir. Bu durum ise yeni değerlendirme yaklaşımlarının kullanılmasını ön plana çıkarmıştır (Bushell, 2006; Dochy, 2001; Falchikov ve Goldfinch, 2000). Öğrencilerin öğrenmelerinde, sorumluluklarını almaları için öz değerlendirme ve akran değerlendirme önemli değerlendirme yaklaşımları olarak görülmekte ve bu değerlendirmelerin kullanılarak öğrencilerin öğretime aktif olarak katılmalarının teşvik edilmesi önerilmektedir. Öğretimde öz ve akran değerlendirmelerinin kullanılması önemi yadsınamayacak bir yarar sağlamaktadır. Çünkü değerlendiricilerin sayısı arttıkça, öğrenciye ilişkin daha fazla resim elde ederek onu çok yönlü tanımak mümkün olabilecektir. Başka bir deyişle öğrenciler, tek bir öğretim elemanının klasik değerlendirme yöntemlerinden daha fazla değerlendirebileceği ölçüde, yaptıkları çalışmaların kalitesi hakkında çok yönlü bir geribildirime sahip olurlar (Millar, 2003). Öğretim sürecinde öz ve akran değerlendirme yöntemleri kullanıldığında en önemli sorun, bu kaynaklardan elde edilen puanların güvenirliği ve bu puanlara dayalı yapılan çıkarımların geçerliği olarak görülmektedir (Donnon, Mcllwrick ve Wololoschuk, 2013). Öğrencinin performansını etkileyen puanlayıcı kaynaklı faktörler puanlayıcı davranışları olarak adlandırılmaktadır (Farrokhi, Esfandiari ve Vaez Dalili, 2011). Bu bağlamda mevcut çalışmanın problem durumu, öz ve akran değerlendirmede hangi puanlayıcı davranışlarının ortaya çıktığı şeklinde belirlenmiştir.Araştırmanın Amacı: Bu çalışmanın amacı, üniversite öğrencilerinin öz ve akran puanlama sürecinde hangi puanlayıcı davranışlarını sergilediklerini çok yüzeyli Rasch ölçme modeli aracılığıyla belirlemektir.Araştırmanın Yöntemi: Araştırma öğretmen adaylarının hazırlamış oldukları araştırma önerilerinin puanlanması sürecinde göstermiş oldukları puanlayıcı davranışlarının ortaya çıkarılmasını hedeflediği için var olan bir durumun betimlenmesinden dolayı betimsel türden bir nicel araştırma özelliği göstermektedir. Araştırmanın katılımcıları 2017-2018 eğitim ve öğretim yılında Ankara ilindeki bir vakıf üniversitenin eğitim fakültesi Rehberlik ve psikolojik danışmanlık programında yer alan bilimsel araştırma yöntemleri dersini alan öğrenciler arasından, çalışma kapsamında gönüllü olarak katılan 58 kişiden oluşmaktadır. Araştırma kapsamındaki veriler, araştırmacılar tarafından geliştirilen analitik dereceli puanlama anahtarı (ADPA) ile toplanmıştır. ADPA, herhangi bir bilimsel araştırma önerisini değerlendirmek amacıyla geliştirilmiştir. Öncelikle taslak olarak geliştirilen ölçme aracına yönelik olarak uzman görüşleri alınmıştır. Görüş ve öneriler doğrultusunda ölçme aracının son şekli verilmiştir. Buna göre, ölçme aracının ölçütleri; problem durumunun belirlenmesi, yöntem, bulgular ve sonuç/yorum olarak belirlenmiştir. ADPA’nın her bir ölçütü dörtlü bir derecelendirme (oldukça yetersiz “0”, oldukça yeterli “3” ) kullanılarak puanlanmıştır. ADPA’dan elde edilen ölçümlerin geçerliği için AFA’i güvenirliği için ise McDonald ω katsayısı kullanılmıştır. Araştırmadaki verilerin analizinde; çok yüzeyli Rasch ölçme modeli kullanılmıştır. Analizler FACETS palet programı kullanılarak yapılmıştır. Analizinin bazı varsayımları bulunmaktadır. Bu varsayımların karşılanması analiz sonuçlarına dayalı yapılan çıkarımların geçerliğine hizmet etmektedir. İlk varsayım olarak tek boyutluluk incelenmiş olup veri toplama araçları kısmında ölçme aracının tek boyutluluğa sahip olduğu görülmüştür. Tek boyutluluğun sağlanması yerel bağımsızlığın da karşılandığının bir göstergesi olarak ele alınmış olup yerel bağımsızlık için herhangi bir işlem yapılmamıştır. Son olarak model veri uyumu incelenmiştir. Model veri uyumu için ±2 aralığının dışında kalan standartlaştırılmış artık değerlerin sayısı toplam gözlem sayısının %5’inden fazla olmaması ve ±3 aralığının dışında kalan standartlaştırılmış artık değerlerin de toplam veri sayısının %1’inden fazla olmaması gerektiği belirtilmiştir (Linacre, 2017). Bu çalışmada toplam gözlem sayısı 2784 (58 x 12 x 4) olup, ±2 aralığının dışında kalan standartlaştırılmış artık değerlerin sayısı 116 (%4.17) ve ±3 aralığının dışında kalan standartlaştırılmış artık değerlerin sayısı ise 28 (%1.ff01) olduğundan mevcut çalışma için model veri uyumunun sağlandığı görülmektedir.

Anahtar Kelimeler:

Akran değerlendirme, Öz değerlendirme, Puanlayıcı yanlılığı, Yeni Yaklaşımlar, Çok Yüzeyli Rasch Modeli

Evaluation of University Students’ Rating Behaviors in Self and Peer Rating Process via Many Facet Rasch Model

Purpose: When self and peer assessment methods become commonly used in the teaching process, the most important problem turns out to be the reliability of the ratings acquired from these sources. Increasing the rater reliability has great importance in the performance evaluation for the reliability of the measurement. This study aimed to determine rater behaviors university students display in the process of self and peer assessment. The research was based on a descriptive model. The participants were 58 students at the Guidance and Psychological Counseling Program in 2017-2018 academic year at a foundation university in Ankara.Findings: Many Facet Rasch Model (MFRM) analysis was applied, and no statistically significant difference of raters’ severity and leniency behaviors in the ratings was observed in terms of gender, but there was a statistically significant difference based on the rater types (self and peer). The raters seemed to be more lenient in self-assessments. The study also showed that while raters showed central tendency behavior on individual level, they did not show such tendency at the group level. It was concluded that individuals’ ratings are more biased than group ratings when they evaluate group performance.Implications for Research and Practice: Some of the raters had differentiating rating behaviors based on the groups. The teacher candidates made systematic mistakes in the performance evaluation process and showed behaviors that had negative effect on the validity of the rating. It is important for the raters to conduct studies to reduce the scoring bias of the raters.

Keywords:

Peer assessment, self-assessment, rater bias, alternative assessment, Many-Facet Rasch Model,

PDF

___

Akin, O. & Basturk, R. (2012). Keman egitiminde temel becerilerin Rasch olcme modeli ile degerlendirilmesi [The evaluation of the basic skills in violin training by many facet Rasch model]. Pamukkale University Journal of Education, 31(1), 175-187. Retrieved from https://dergipark.org.tr/pauefd/issue/11112/132860
Andrade, H. G. (2005). Teaching with Rubrics: The Good, the Bad, and the Ugly. College Teaching, 53(1), 27-31. https://doi.org/10.3200/CTCH.53.1.27-31
Andrade, H., Du, Y. & Mycek, K. (2010). Rubric‐referenced self‐assessment and middle school students' writing. Assesment in Education Principles Policy and Practice, 17(2), 199-214. https://doi.org/10.1080/09695941003696172
Baird, J. A., Hayes, M., Johnson, R., Johnson, S., & Lamprianou, I. (2013). Marker effects and examination reliability. A Comparative exploration from the perspectives of generalisability theory, Rash model and multilevel modelling. Oxford: University of Oxford for Educational Assessment. Retrieved from http://dera.ioe.ac.uk/17683/1/2013-01-21-marker-effects-and-examination-reliability.pdf
Ballantyne, R., Hughes, K., & Mylonas, A. (2002). Developing procedures for implementing peer assessment in large classes using an action research process. Assessment and Evaluation in Higher Education, 27, 427-441. https://doi.org/10.1080/0260293022000009302
Basturk, S. (2008). Ogretmenlik uygulamasi dersinin uygulama ogretmenlerinin gorüslerine dayali olarak degerlendirilmesi [Evaluation of teaching practicum course based on the mentors’ opinions]. Educational Sciences and Practice, 7(14), 93-110. Retrieved from http://ebuline.com/turkce/arsiv/14_7.aspx
Bushell, G. (2006). Moderation of peer assessment in group projects. Assessment and Evaluation in Higher Education, 31, 91-108. https://doi.org/10.1080/02602930500262395
Cakici-Eser, D. & Gelbal, S. (2013). Genellenebilirlik kurami ve lojistik regresyona dayali hesaplanan puanlayicilar arasi tutarligin karsilastirilmasi [Comparison of interrater agreementcalculated with generalizability theory and logistic regression]. Kastamonu Education Journal, 21(2), 423-438. Retrieved from http://www.kefdergi.com/pdf/21_2/21_2_2.pdf
Cetin, B., & Ilhan, M. (2017). An Analysis of Rater Severity and Leniency in Open-ended Mathematic Questions Rated Through Standard Rubrics and Rubrics Based on the SOLO Taxonomy. Education and Science, 42(189), 217-247. https://doi.org/10.15390/EB.2017.5082
Dochy, F., & McDowell, L. (1997). Assessment as a tool for learning. Studies in Educational Evaluation, 23, 279-298. https://doi.org/10.1016/S0191-491X(97)86211-6
Donnon, T., McIlwrick, J. & Woloschuk, W. (2013). Investigating the reliability and validity of self and peer assessment to measure medical students' professional competencies. Creative Education, 4(6A), 23-28. https://doi.org/10.4236/ce.2013.46A005
Engelhard, G. (1994). Examining rater errors in the assessment of written composition with a many-faceted Rasch model. Journal of Educational Measurement, 31(2), 93-112. https://doi.org/10.1111/j.1745-3984.1994.tb00436.x
Engelhard, G., & Stone, G.E. (1998). Evaluating the quality of ratings obtained from standard-setting judges. Educational and Psychological Measurement, 58(2), 179-196. https://doi.org/10.1177/0013164498058002003
Falchikov, N. & Goldfinch, J. (2000). Student peer assessment in higher education: A meta-analysis comparing peer and teacher marks. Review of Educational Research, 70 (3), 287-322. https://doi.org/10.3102/00346543070003287
Falchikov, N., & Boud, D. (1989). Student self-assessment in higher education: A meta-analysis. Review of Educational Research, 59, 395-430. https://doi.org/10.3102/00346543059004395
Farrokhi, F., & Esfandiari, R. (2011). A many-facet Rasch model to detect halo effect in three types of raters. Theory & Practice in Language Studies, 1(11), 1531-1540. https://doi.org/10.4304/tpls.1.11.1531-1540
Farrokhi, F., Esfandiari, R. & Dalili, M.V. (2011). Applying the Many-Facet Rasch Model to detect centrality in self-Assessment, peer-assessment and teacher assessment. World Applied Sciences Journal 15 (Innovation and Pedagogy for Lifelong Learning), 70-77. Retrieved from https://pdfs.semanticscholar.org/dd21/ba5683dde8b616374876b0c53da376c10ca9.pdf
Farrokhi, F., Esfandiari, R. & Schaefer, E. (2012). A Many-Facet Rasch Measurement of differential rater severity/leniency in self assessment, peer assessment, and teacher assessment. Journal of Basic and Applied Scientific Research, 2 (9), 8786-8798. Retrieved from https://jalt-publications.org/files/pdf-article/jj2012a-art4.pdf
Guler, N. (2008). Klasik test kurami, genellenebilirlik kurami ve Rasch modeli uzerine bir arastirma [A research on classical test theory generalizaibility theory and rasch model]. Unpublished thesis, Hacettepe Universitesi, Ankara.
Hauenstein, N. M. A. & McCusker, M. E. (2017). Rater training: Understanding effects of training content, practice ratings, and feedback. International Journal of Selection and Assessment, 25, 253-266. https://doi.org/10.1111/ijsa.12177
Jonsson, A., & Svingby, G. (2007). The use of scoring rubrics: Reliability, validity and educational consequences. Educational research review, 2(2), 130-144. https://doi.org/10.1016/j.edurev.2007.05.002
Karakaya, I. (2015). Comparison of self, peer and instructor assessments in the portfolio assessment by using many facet RASCH model. Journal of Education and Human Development, 4(2), 182-192. https://doi.org/10.15640/jehd.v4n2a22
Kim, Y., Park, I., & Kang, M. (2012). Examining rater effects of the TGMD-2 on children with intellectual disability. Adapted Physical Activity Quarterly, 29(4), 346-365. https://doi.org/10.1123/apaq.29.4.346
Kubiszyn, T., & Borich, G. (2013). Educational testing and measurement: Classroom application and practice. Hoboken, NJ: John Wiley & Sons, Inc.
Kutlu, O., Yildirim, O. & Bilican, S. (2009). Ogretmenlerin dereceli puanlama anahtarlarina iliskin tutum olcegi gelistirme calismasi [Study of attitudes scale development aimed at scoring rubrics for primary school teachers]. Journal of Yuzuncu Yil University Faculty of Education, 6(2), 76-88. Retrieved from https://dergipark.org.tr/yyuefd/issue/13712/166014
Kwan, K., & Leung, R. (1996). Tutor versus peer group assessment of student performance in a simulation training exercise. Assessment and Evaluation in Higher Education, 21, 205-215. https://doi.org/10.1080/0260293960210301
Landry, A., Shoshanah, J. & Newton, G. (2015). Effective use of peer assessment in a graduate level writing assignment: A case study. International Journal of Higher Education, 4(1)., 38-41. https://doi.org/10.5430/ijhe.v4n1p38
Lejk, M. & Wyvill, M. (2001). The Effect of the inclusion of self-assessment with peer-assessment of contributions to a group project. Assessment and Evaluation in Higher Education, 26(6), 551-61. https://doi.org/10.1080/02602930120093887
Linacre, J. M. (1996). Generalizability theory and many-facet Rasch measurement. Objective measurement: Theory into practice, 3, 85-98. Retrieved from https://files.eric.ed.gov/fulltext/ED364573.pdf
Linacre, J.M. (2017). A user's guide to FACETS: Rasch-model computer programs. Chicago: MESA Press. Lumley, T.& McNamara, T. F. (1995). Rater characteristics and rater bias: implications for training. Language Testing, 12 (1), 54-71. https://doi.org/10.1177/026553229501200104
Lunz, M. E., Wright, B. D. & Linacre, J. M. (1990). Measuring the impact of judge severity on examination scores. Applied Measurement in Education, 3(4), 331-345. https://doi.org/10.1207/s15324818ame0304_3
McDonald, M. B. (1999). Seed deterioration: Physiology, repair and assessment. Seed Science and Technology, 27 (1), 177-237. Retrieved from https://ci.nii.ac.jp/naid/10025267238/
McNamara, T. F., & Adams, R. J. (1991). Exploring rater behavior with Rasch techniques. Language Testing Research Colloquium, 1-29. Retrieved from https://files.eric.ed.gov/fulltext/ED345498.pdf
Millar, J. (2003). Gender, poverty and social exclusion. Social Policy and Society, 2(3), 181 - 188. https://doi.org/10.1017/S1474746403001246
Moore, B.B. (2009). Consideration of rater effects and rater design via signal detection theory. Unpublished Doctoral Dissertation. Columbia University, New York.
Mulqueen, C., Baker, D., & Dismukes, R.K., (2000). Using multi facet Rasch analysis to examine the effectiveness of rater training. 15th Annual Conference for the Society for Industrial and Organizational Psychology, https://doi.org/10.1037/e540522012-001
Myford, C. M., & Wolfe, E.W. (2003). Detecting and measuring rater effects using many-facet Rasch measurement: Part I. Journal of Applied Measurement, 4(4), 386-422. Retrieved from http://jampress.org/
Myford, C.M., & Wolfe, E.W. (2004). Detecting and measuring rater effects using many-facet Rasch measurement: Part II. Journal of Applied Measurement, 5, 189-227. Retrieved from http://jampress.org/
Oosterhof, A. (2003). Developing and using classroom assessment. USA: Merrill/Prentice Hall.
Osburn, H. G. (2000). Coefficient alpha and related internal consistency reliability coefficients. Psychological methods, 5(3), 343. http://dx.doi.org/10.1037/1082-989X.5.3.343
Porter, D., & Shen, S. (1991). Sex, status and style in the interview. The Dolphin, 21(2), 117-128. https://doi.org/10.1002/pssa.2211280113
Puhl, C. A. (1997). Develop, not judge: Continuous assessment in the ESL classroom. Forum Magazine, 35(2), 2-9. Retrieved from https://eric.ed.gov/?id=EJ593288
Saal, F. E. , Downey, R. G. & Lahey, M. A. (1980). Rating the ratings: Assessing the psychometric quality of rating data. Psychological Bulletin, 88, 413-428. https://doi.org/10.1037/0033-2909.88.2.413
Schaefer, E. (2008). Rater bias patterns in an EFL writing assessment. Language Testing, 25(4), 465-493. https://doi.org/10.1177/0265532208094273
Sudweeks, R. R., Reeve, S.& Bradshaw, W. S. (2004). A comparison of generalizability theory and many-facet Rasch measurement in an analysis of college sophomore writing. Assessing Writing, 9(3), 239-261. https://doi.org/10.1016/j.asw.2004.11.001
Temizkan, M. (2009). Akran degerlendirmenin konusma becerisinin gelistirilmesi uzerindeki etkisi[The effect of peer assessment on the development of speaking skill]. Mustafa Kemal University Journal of Social Sciences Institute, 6(12), 90-98. Retrieved from http://sbed.mku.edu.tr/article/view/1038000386
Topping, K. (1998). Peer assessment between students in colleges and universities. Review of Educational Research, 68(3), 249-276. https://doi.org/10.3102/00346543068003249
Topping, K. (2003). Self and peer assessment in school and university: Reliability, validity and utility, optimising new modes of assessment: In Search of Qualities and Standards Innovation and Change in Professional Education, 1, 55-87. https://doi.org/10.1007/0-306-48125-1_4
Topping, K. J., Smith, E. F., Swanson, I.& Elliot, A. (2000). Formative peer assessment of academic writing between postgraduate students. Assessment & Evaluation in Higher Education, 25(2), 149-169. https://doi.org/10.1080/713611428
Unal, G., & Ergin, O. (2006). Bulus yoluyla fen ogretiminin ogrencilerin akademik basarilarina, ogrenme yaklasimlarina ve tutumlarina etkisi [Academic of students in science teaching through invention effect on successes, learning approaches and attitudes]. Journal of Turkish Science Education, 3(1), 36-52. Retrieved from http://www.tused.org/internet/tufed/arsiv/v3/i1/metin/tufedv3i1s3.pdf
Van-Trieste, R. F. (1990). The relation between Puerto Rican university students’ attitudes toward Americans and the students’ achievement in English as a second language. Homines, 13–14, 94–112.
Weigle, S. (1998). Using FACETS to model rater training effects. Language Testing, 15(2), 263-87. https://doi.org/10.1177/026553229801500205
Weigle, S. (1999). Investigating rater/prompt interactions in writing assessment: Quantitative and qualitative approaches. Assessing Writing, 6 (2), 145-178. https://doi.org/10.1016/S1075-2935(00)00010-6
Weigle, S. C. (1994). Effects of training on raters of ESL compositions. Language Testing, 11 (2), 197-223. https://doi.org/10.1177/026553229401100206
Winke, P., Gass, S., & Myford, C. (2012). Raters’ L2 background as a potential source of bias in rating oral performance. Language Testing, 30(2), 231-252. https://doi.org/10.1177/0265532212456968