Score Dependability of a Speaking Test of Turkish as a Second Language: A Generalizability Study

Öz Test developers have to attend to all aspects of validity throughout test development and implementation. As one of the major aspects, scoring validity has to be established for the dependability of scores assigned to a test performance. This study is an investigation into the scoring validity of a speaking test of Turkish as a second language (TSL). For this purpose, in this study, six tasks and a rating scale were developed and administered to twenty-four L2 learners of Turkish whose performance was evaluated by four raters. The score dependability was investigated through Generalizability (G) and Decision (D) analyses. The results indicated that most of the score variation could be attributed to test takers, and not to error variance, i.e. raters and tasks.

___

  • Alderson, J. C., Clapham, C., & Wall, D. (1995). Language test construction and evaluation. Cambridge: Cambridge University Press.
  • Bachman, L. F. (1990). Fundamental considerations in language testing. Oxford: Oxford University Press.
  • Bachman, L. F., & Palmer, A. S. (1996). Language testing in practice. Oxford: Oxford University Press.
  • Bachman, L. F., & Savignon, S. J. (1986). The evaluation of communicative language proficiency: A critique of the ACTFL oral interview. The Modern Language Journal, 70(4), 380-390.
  • Bachman, L. F., Lynch, B. K., & Mason, M. (1995). Investigating variability in tasks and rater judgements in a performance test of foreign language speaking. Language Testing, 12(2), 238-257.
  • Barkaoui, K. (2007). Rating scale impact on EFL essay marking: A mixed-method study. Assessing Writing, 12(2), 86-107.
  • Brennan, R. L. (2000). Performance assessments from the perspective of generalizability theory. Applied Psychological Measurement, 24(4), 339-353.
  • Brown, A. (1995). The effect of rater variables in the development of an occupation-specific language performance test. Language Testing, 12(1), 1-15.
  • Brown, J. D. (1999). The relative importance of persons, items, subtests and languages to TOEFL test variance. Language Testing, 16(2), 217-238.
  • Brown, J. D. (2005). Statistics corner, questions and answers about language testing statistics: Generalizability and decision studies. Shiken: JALT Testing & Evaluation SIG Newsletter, 9(1), 12–16. Retrieved from http://jalt.org/test/bro_21.htm
  • Brown, J. D., & Ahn, R. C. (2011). Variables that affect the dependability of L2 pragmatics tests. Journal of Pragmatics, 43(1), 198-217.
  • Brown, J. D., & Kondo-Brown, K. (2012). Rubric-based scoring of Japanese essays: The effects on generalizability of numbers of raters and categories. In J. D. Brown (Eds.), Developing, using, and analyzing rubrics in language assessment with case studies in Asian and Pacific languages (pp. 169-182). National Foreign Language Resource Center: University of Hawai’i at Manoa.
  • Carey, M. D., Mannell, R. H., & Dunn, P. K. (2011). Does a rater's familiarity with a candidate's pronunciation affect the rating in oral proficiency interviews? Language Testing, 28(2), 201–219.
  • Chalhoub-Deville, M. (2003). Second language interaction: Current perspectives and future trends. Language Testing, 20(4), 369-383.
  • Chalhoub-Deville, M. (2006). Drawing the line: The generalizability and limitations of research in applied linguistics. In M. Chalhoub-Deville, C. A. Chapelle, & P. Duff (Eds.), Inference and generalizability in applied linguistics: Multiple perspectives (pp. 1-5). Amsterdam: John Benjamin Publishing Company.
  • Chalhoub‐Deville, M., & Wigglesworth, G. (2005). Rater judgment and English language speaking proficiency. World Englishes, 24(3), 383-391.
  • Council of Europe. (2001). Common European Framework of Reference for Languages: Learning, teaching, assessment. Cambridge: Cambridge University Press.
  • Davies, A., Brown, A., Elder, C., Hill, K., Lumley, T., & McNamara, T. (1999). Dictionary of language testing. Cambridge: Cambridge University Press.
  • Deville, C., & Chalhoub-Deville, M. (2006). Old and new thoughts on test score variability: Implications for reliability and avlidity. In M. Chalhoub-Deville, C. A. Chapelle, & P. Duff (Eds.), Inference and generalizability in applied linguistics. Multiple perspectives. (pp. 9- 25). Amsterdam: John Benjamin Publishing Company.
  • Ducasse, A. M., & Brown, A. (2009). Assessing paired orals: Raters' orientation to interaction. Language Testing, 26(3), 423-443.
  • Dunbar, S. B., Koretz, D. M., & Hoover, H. D. (1991). Quality control in the development and use of performance assessments. Applied Measurement in Education, 4(4), 289-303.
  • Eckes, T. (2005). Examining rater effects in TestDaF writing and speaking performance assessments: A many-facet Rasch analysis. Language Assessment Quarterly, 2, 197–221.
  • Eckes, T. (2009). Many-facet Rasch measurement. Reference supplement to the manual for relating language examinations to the Common European Framework of Reference for Languages: Learning, teaching, assessment. Retrieved from http://www.coe.int/t/dg4/Linguistic/CEF-refSupp-SectionH.pdf
  • Eckes, T. (2011). Introduction to many-facet Rasch measurement: Analyzing and evaluating rater-mediated assessments. Frankfurt, Germany: Peter Lang.
  • Educational Testing Service (2004). IBT/ Next generation TOEFL test independent speaking rubrics (scoring standards). Retrieved from http://www.ets.org/Media/Tests/TOEFL/pdf/Speaking_Rubrics.pdf
  • Elder, C., Iwashita, N. & McNamara, T. (2002). Estimating the difficulty of oral proficiency tasks: What does the test-taker have to offer? Language Testing, 19(4), 337-346.
  • Elder, C., Knoch, U., Barkhuizen, G., & von Randow, J. (2005). Individual feedback to enhance rater training: Does it work? Language Assessment Quarterly, 2(3), 175-196. Field, J. (2011). Cognitive validity. In L. Taylor (Ed.), Examining Speaking: Research and practice in assessing second language speaking (pp. 65-111), Cambridge: Cambridge University Press.
  • Fulcher, G. (1996a). Does thick description lead to smart tests? A data-based approach to rating scale construction. Language Testing, 13(2), 208-238.
  • Fulcher, G. (1996b). Testing tasks: Issues in task design and the group oral. Language Testing, 13, 23–51.
  • Fulcher, G. (2003). Testing second language speaking, Harlow: Longman/Pearson Education Ltd.
  • Fulcher, G., & Reiter, R. M. (2003). Task difficulty in speaking tests. Language Testing, 20(3), 321-344.
  • Galaczi, E., & ffrench, A. (2011). Context validity. In L. Taylor (Eds.), Examining Speaking: Research and practice in assessing second language speaking (pp. 112-170), Cambridge: Cambridge University Press.
  • Gülle, T. (2015). Development of a speaking test for second language learners of Turkish. (Unpublished Master’s Thesis). Boğaziçi University, Istanbul, Turkey.
  • Hasselgreen, A. (2004). Testing the spoken English of young Norwegians: A study of testing validity and the role of smallwords in contributing to pupils' fluency. Cambridge: Cambridge University Press.
  • Hoyt, W. T. (2000). Rater bias in psychological research: When is it a problem and what can we do about it? Psychological Methods, 5, 64–86.
  • Huang, J. (2012). Using generalizability theory to examine the accuracy and validity of large-scale ESL writing assessment. Assessing Writing, 17(3), 123-139.
  • Hughes, A. (2003). Testing for language teachers. Cambridge: Cambridge University Press.
  • International English Language Testing System. (2009). SPEAKING: Band descriptors (public version), retrieved from https://www.ielts.org/pdf/SpeakingBanddescriptors.pdf
  • Isaacs, T., & Thomson, R. I. (2013). Rater experience, rating scale length, and judgments of L2 pronunciation: Revisiting research conventions. Language Assessment Quarterly, 10(2), 135-159.
  • Iwashita, N., McNamara, T., & Elder, C. (2001). Can we predict task difficulty in an oral proficiency test? Exploring the potential of an information-processing approach to test design. Language Learning, 51(3), 401-436.
  • Johnson, J. S., & Lim, G. S. (2009). The influence of rater language background on writing performance assessment. Language Testing, 26(4), 485-505.
  • Khalifa, H., & Weir, C. J. (2009). Examining Reading: Research and Practice in Assessing Second Language Reading, Studies in Language Testing 26, Cambridge: Cambridge University Press.
  • Kim, Y. H. (2009). An investigation into native and non-native teachers' judgments of oral English performance: A mixed methods approach. Language Testing, 26(2), 187-217.
  • Kondo-Brown, K. (2002). A FACETS analysis of rater bias in measuring Japanese second language writing performance. Language Testing, 19(1), 3-31.
  • Lee, Y. W. (2005). Dependability of scores for a new ESL speaking test: Evaluating prototype tasks. TOEFL® Monograph MS-28. Princeton, NJ: ETS.
  • Lee, Y. W. (2006). Dependability of scores for a new ESL speaking assessment consisting of integrated and independent tasks. Language Testing, 23(2), 131-166.
  • Lee, Y. W., & Kantor, R. (2005). Dependability of new ESL writing test scores: Evaluating prototype tasks and alternative rating schemes. ETS Research Report Series, 2005(1), i-76.
  • Lumley, T., & McNamara, T. F. (1995). Rater characteristics and rater bias: Implications for training. Language Testing, 12(1), 54-71.
  • Lumley, T., & O’Sullivan, B. (2005). The effect of test-taker gender, audience and topic on task performance in tape-mediated assessment of speaking. Language Testing, 22(4), 415-437.
  • Luoma, S. (2004) Assessing speaking. Cambridge: Cambridge University Press.
  • Lynch, B. K., & McNamara, T. F. (1998). Using G-theory and many-facet Rasch measurement in the development of performance assessments of the ESL speaking skills of immigrants. Language Testing, 15(2), 158-180.
  • McNamara, T. F. (1996). Measuring second language performance. London: Longman.
  • McNamara, T., Hill, K., & May, L. (2002). Discourse and assessment. Annual Review of Applied Linguistics, 22, 221-242.
  • McNamara, T.F. & Adams, R.J. (1991). Exploring rater behavior with Rasch Techniques. Paper presented at the Annual Language Testing Research Colloquium (Princeton, NJ, March 21–23). Eric document ED345 498.
  • Messick, S. (1989) Validity. In R. L Linn (Eds.), Educational measurement (pp. 13-103). New York: Macmillan/American Council on Education.
  • Myford, C. M., & Wolfe, E. W. (2003). Detecting and measuring rater effects using many-facet Rasch measurement: Part I. Journal of Applied Measurement, 4(4), 386-422.
  • Myford, C. M., & Wolfe, E. W. (2004). Detecting and measuring rater effects using many-facet Rasch measurement: Part II. Journal of Applied Measurement, 5(2), 189-227.
  • Politt, A., & Murray, N. L. (1996). What do raters really pay attention to? In M. Milanovic, & N. Saville (Eds.), Performance testing, cognition and assessment: Selected papers from the 15th Language Testing Research Colloquium (pp. 74–91). Cambridge: Cambridge University Press.
  • Sawaki, Y. (2007). Construct validation of analytic rating scales in a speaking assessment: Reporting a score profile and a composite. Language Testing 24(3), 355–90.
  • Shavelson, R. J., Webb, N. M., & Rowley, G. L. (1989). Generalizability theory. American Psychologist, 44(6), 922.
  • Shaw, S. D. & Weir, C. J. (2007). Examining Writing: Research and Practice in Assessing Second Language Writing. Studies in Language Testing 26, Cambridge: UCLES/Cambridge University Press.
  • Shi, L. (2001). Native-and nonnative-speaking EFL teachers’ evaluation of Chinese students’ English writing. Language Testing, 18(3), 303-325.
  • Skehan, P. (1998). A Cognitive Approach to Language Learning. Oxford: Oxford University Press. Taylor, L. (2011). Examining Speaking: Research and practice in assessing second language speaking, Studies in Language Testing 30, Cambridge: Cambridge University Press.
  • Taylor, L., & Galaczi, E. (2011). Scoring validity. In L. Taylor (Eds.), Examining Speaking: Research and practice in assessing second language speaking (pp. 171-233), Studies in Language Testing 30, Cambridge: Cambridge University Press.
  • TELC (2013). Diller İçin Avrupa Ortak Öneriler Çerçevesi Öğrenim, Öğretim ve Değerlendirme. Frankfurt: Telc GmbH.
  • Teng, H. (2007). A study of task types for L2 speaking assessment. (ERIC Document Reproduction Service No. ED496075). Retrieved from http://eric.ed.gov/?id=ED496075
  • University of Cambridge ESOL Examinations. (2015). Information for candidates. Retrived from http://www.cambridgeenglish.org/images/173976-cambridge-english-advanced-examiners-comments.pdf
  • Weigle, S. C. (1994). Effects of training on raters of ESL compositions. Language Testing, 11(2), 197-223.
  • Weigle, S. C. (1998). Using FACETS to model rater training effects. Language Testing, 15(2), 263-287.
  • Weir, C. J. (1993). Understanding and developing language tests. New York: Prentice Hall.
  • Weir, C. J. (2005a). Language testing and validation: An evidence-based approach. Basingstoke: Palgrave Macmillan.
  • Weir, C. J. (2005b). Limitations of the Common European Framework for developing comparable examinations and tests. Language Testing, 22(3), 281-300.
  • Wigglesworth, G. (1993). Exploring bias analysis as a tool for improving rater consistency in assessing oral interaction. Language Testing, 10(3), 305-319
  • Winke, P., Gass, S., & Myford, C. (2013). Raters’ L2 background as a potential source of bias in rating oral performance. Language Testing, 30(2), 231-252.
  • Xi, X., & Mollaun, P. (2006). Investigating the utility of analytic scoring for the TOEFL Academic Speaking Test (TAST). ETS Research Report Series, i-71.
  • Xi, X., & Mollaun, P. (2009). How do raters from India perform in scoring the TOEFL iBT™ speaking section and what kind of training helps? ETS Research Report Series, 2009(2), i-37.
  • Xi, X., & Mollaun, P. (2011). Using raters from India to score a large‐scale speaking test. Language Learning, 61(4), 1222-1255.
  • Zhang, Y. and Elder, C. (2011). Judgments of oral proficiency by non-native and native English speaking teacher raters: Competing or complementary constructs? Language Testing, 28(1), 31-50.