Ömay ÇOKLUK-BÖKEOGLU, Arzu UÇAR, Ebru BALTA

Madde Tepki Kuramına Dayalı Gerçek Puan Eşitlemede Ölçek Dönüştürme Yöntemlerinin İncelenmesi

Bu araştırmada, Madde Tepki Kuramı’na (MTK) dayalı gerçek puan eşitlemede, ölçek dönüştürme yöntemlerinin (ortalama-ortalama (OO), ortalama-standart sapma (OS), Stocking-Lord (SL), Haebara (HB)) farklı koşullar altında eşitleme hatalarının karşılaştırılması amaçlanmıştır. Araştırmanın amacı doğrultusunda, yöntemlerin hatalarını karşılaştırmak için örneklem büyüklüğü (500, 1000, 3000, 10000), test uzunluğu (40, 50, 80), ortak madde oranı (%20-%30-%40), parametre kestirim modeli (iki ve üç parametreli lojistik model (2PLM ve 3PLM)) ve grupların yetenek dağılımı (benzer (N(0-1) - N(0-1)), farklı (N(0-1) - N(0.5,1)) koşulları altında 2PLM ve 3PLM’ye uyumlu iki kategorili 50 yineleme ile 7200 veri seti oluşturulmuştur. Veri toplamı deseni olarak “denk olmayan gruplarda ortak madde/test (NEAT) deseni” kullanılmıştır. Veri üretiminde ve analizinde R yazılımı kullanılmıştır. Araştırmadan elde edilen bulgular, eşitleme hatası (RMSD) ölçütüne göre değerlendirilmiştir. Çalışmanın sonucunda, tüm koşullar göz önünde bulundurulduğunda, SL yönteminin RMSD değerlerinin, diğer yöntemlere göre daha yüksek olduğu görülmekle birlikte, OO ve OS yöntemlerinin birbirine benzer RMSD değerleri ürettiği görülmüştür. Ayrıca, ölçek dönüştürme yöntemlerine ilişkin RMSD değerleri karşılaştırıldığında, 2PLM ve 3PLM’nin kullanıldığı durumlarda benzer sonuçlar elde edildiği, örneklem büyüklüğü ve test uzunluğu arttıkça SL yöntemi dışında diğer yöntemlerin eşitleme hatalarında azalma oluştuğu ve ortak madde oranının %40 ve grupların yetenek dağılımının benzer olduğu durumlarda, yöntemlerin, RMSD değerlerinin daha düşük olduğu gözlenmiştir.

Anahtar Kelimeler:

Haebara, MTK gerçek puan eşitleme ortalama-ortalama, ortalama-standart sapma, ölçek dönüştürme, Stocking-Lord, test eşitleme

Investigation of Scale Transformation Methods in True Score Equating Based on Item Response Theory

In this research, it was aimed to compare equating errors of scale transformation methods (mean-mean (MM), mean-sigma (MS), Heabera (HB) and Stocking-Lord (SL)) in true score equating based on item response theory (IRT) under different conditions. In line with the purpose of the study, 7200 dichotomous data sets which were consistent with two and three- parameter logistic model were generated with 50 replication under the conditions of sample size (500, 1,000, 3,000, 10,000), test length (40, 50, 80), rate of the common item (20%, 30%, 40%), type of model used in parameter estimation (two and three-parameter logistic models (2PLM and 3PLM)), and ability distribution of groups (similar (N(0-1) - N(0-1)), different (N(0-1) - N(0.5,1)) for the obtained performance of methods. Common item nonequivalent groups equating design was used. R software was used for data generation and analyses. The results obtained from the study were evaluated by using equating error (RMSD) criterion. As a result of the study, considering all the conditions, it was seen that the RMSD values of the SL method were higher than the other methods, but it was seen that the MM and MS methods produced similar RMSD values. In addition, when the RMSD values of the scale transformation methods are compared, similar results are obtained in cases where 2PLM and 3PLM are used, as the sample size and test length increase, equating errors of other methods except the SL method decrease, and It was observed that the methods had lower RMSD values in cases where the common item rate is 40% and the ability distribution of the groups is similar.

Keywords:

Haebara IRT true score equating, mean-mean, mean-sigma, scale transformation, Stocking-Lord, test equating,

PDF

___

Aksekioğlu, B. (2017). Madde tepki kuramına dayalı test eşitleme yöntemlerinin karşılaştırılması: PISA 2012 Fen testi örneği (Tez Numarası: 454879) [Yüksek lisans tezi, Akdeniz Üniversitesi]. Yükseköğretim Kurulu Ulusal Tez Merkezi. https://tez.yok.gov.tr/UlusalTezMerkezi/
Angoff, W. H. (1971). Scales, norms and equivalent scores. In R. L. Thorndike (Eds.), Educational measurement (2nd ed., pp. 508-600). American Council on Education.
Angoff, W. H. (1984). Scales, norms and equivalent scores. Educational Testing Service.
Arai, S., and Mayekawa, S. (2011). A comparison of equating methods and linking designs for developing an item pool under item response theory. Behaviormetrica, 38(1), 1-16. https://link.springer.com/article/10.2333/bhmk.38.1
Babcock, B., Albano, A., and Raymond, M. (2012). Nominal weights mean equating: A method for very small samples. Educational and Psychological Measurement, 72(4), 608–628. https://doi.org/10.1177/0013164411428609
Baker, F. B., and Al-Karni, A. (1991). A comparison of two procedures for computing IRT equating coefficients. Journal of Educational Measurement, 28(2), 147- 162. http://www.jstor.org/stable/1434796
Barnard, J. J. (1996). In search of equity in educational measurement: traditional versus modern equating methods. [Paper presentation]. ASEESA’s National Conference at the HSRC Conference Center 1996 Annual Meeting, Pretoria, South Africa.
Bastari, B. (2000). Linking multiple-choice and constructed-response items to a common proficiency scale [Doctoral dissertation, Massachusetts Institute of Technology]. https://scholarworks.umass.edu/dissertations_1/5557
Brossman, B. G., and Lee. W-C. (2013). Observed score and true score equating procedures for multidimensional item response theory. Applied Psychological Measurement, 37(6), 460–481. https://doi.org/10.1177/0146621613484083
Budescu, D. (1985). Efficiency of linear equating as a function of the length of the anchor test. Journal of Educational Measurement, 22(1), 13-20. https://www.jstor.org/stable/1434562
Caldwell, L. J. (1984). A comparison of equating error in lineer and rasch model test equating method [Unpublished doctoral dissertation]. Florida State University.
Cao, Y. (2008). Mixed-format test equating: Effects of test dimensionality and common-item sets (Publication No. 3341415) [Doctoral dissertation, University of Maryland-Maryland]. ProQuest Dissertations and Theses Global.
Chen, H. W. (2001). Calibration of the ITBS test battery to the complete test battery: A comparison five linking methods (Publication No. 3009576) [Doctoral dissertation, University of Iowa-Iowa]. ProQuest Dissertations and Theses Global.
Cho, Y. (2007). Comparison of bookstrap standard errors of equating using IRT and equipercentile methods with polytomously-scored items under the commonitem-nonequivalent-group design (Publication No. 3301690) [Doctoral dissertation, University of Iowa-Iowa]. ProQuest Dissertations and Theses Global.
Chon, K. H., Lee, W. C., and Ansley, T. N. (2007). Assessing IRT model-data fit for mixed format tests. (No.26). Center for Advanced Studies in Measurement and Assessment. https://www.semanticscholar.org/paper/Number-26-Assessing IRT-Model-Data-Fit-for-Mixed-%E2%88%97-Chon Lee/49c57e474a54beed3010ab0f2af64985ce6ddb50
Chu, K-L. (2002). Equivalent group test equating with the presence of differantial item functioning (Publication No. 3065477) [Doctoral dissertation, The Florida State University-Florida]. ProQuest Dissertations and Theses Global.
Cook, L. L., and Eignor, D. R. (1991). An NCME instructional module on IRT equating methods. Educational measurement: Issues and Practice. 10 (3), 37- 45. https://eric.ed.gov/?id=EJ436860
Crocker, L., and Algina, J. (1986). Introduction to classical and modern test theory. Harcourt Brace Jovanovich College.
Cui, Z., and Kolen, M. J. (2008). Comparison of parametric and nonparametric bootstrap methods for estimating random error in equipercentile equating. Applied Psychological Measurement, 32(4), 334-347. https://doi.org/10.1177/0146621607300854
Dorans, N. J. (1990). Equating methods and sampling designs. Applied Measurement in Education, 3(1), 3-17. https://doi.org/10.1207/s15324818ame0301_2
Dorans, N. J., and Holland P. W. (2000). Population invariance and the equatability of tests: Basic theory and the linear case. Journal of Educational Measurement, 37(4), 281-306. https://doi.org/10.1111/j.1745-3984.2000.tb01088.x
Dorans, N. J., Moses, T. P., and Eignor, D. R. (2010). Principles and practices of test score equating. (No.41). Educational Testing Service. https://www.ets.org/research/policy_research_reports/publications/report/2010/ ilrs
Drasgow, F., Levine, M. V., Tsien, S., Williams, B., and Mead, A. D. (1995). Fitting polytomous item response theory models to multiple-choice tests. Applied Psychological Measurement, 19(2), 143– 165. https://doi.org/10.1177/014662169501900203
Eid, G. K. (2005). The effects of sample size on the equating of test items. Education, 126(1), 165-180. https://www.thefreelibrary.com/The+effects+of+sample+size+on+the+equatin g+of+test+items.-a0136846803
Felan, G. D. (2002, February 14-16). Test equating: Mean, linear, equipercentile and item response theory. [Paper presentation]. Southwest Educational Research Association 2002 Annual Meeting, Austin, TX, United States.
Fraenkel, J. R., Wallen, N. E., and Hyun, H. H. (2012). How to design and evaluate research in education. McGraw-Hill.
Godfrey, K. E. (2007). A comparison of Kernel equating and IRT true score equating methods (Publication No. 3273329) [Doctoral dissertation, The University of North Carolina-Chapel Hill]. ProQuest Dissertations and Theses Global.
González, J. (2014). SNSequate: Standard and nonstandard statistical models and methods for test equating. Journal of Statistical Software, 59(7), 1-30. https://www.jstatsoft.org/index
Gök, B., ve Kelecioğlu, H. (2014). Denk olmayan gruplarda ortak madde deseni kullanılarak madde tepki kuramına dayalı eşitleme yöntemlerinin karşılaştırılması. Mersin Üniversitesi Eğitim Fakültesi Dergisi, 10(1), 120-136. https://dergipark.org.tr/tr/pub/mersinefd/issue/17393/181786
Gül, E., Doğan-Gül, Ç., Çokluk-Bökeoğlu, Ö. ve Özkan, M. (2017). Temel eğitimden ortaöğretime geçiş matematik alt testi asıl sınav ve mazeret sınavlarının madde tepki kuramına göre eşitlenmesi. Abant İzzet Baysal Üniversitesi Eğitim Fakültesi Dergisi, 17(4), 1900-1915. https://dergipark.org.tr/tr/pub/aibuefd/issue/32772/363973
Gündüz, T. (2015). Test eşitlemede Madde Tepki Kuramına dayalı yetenek parametresine yönelik ölçek dönüştürme yöntemlerinin karşılaştırılması] (Tez Numarası: 429524) [Yüksek lisans tezi, Gazi Üniversitesi]. Yükseköğretim Kurulu Ulusal Tez Merkezi. https://tez.yok.gov.tr/UlusalTezMerkezi/
Haebara, T. (1980). Equating logistic ability scales by a weighted least squares method. Japanese Psychological Research, 22(3), 144- 149. https://doi.org/10.4992/psycholres1954.22.144
Hagge, S. L. (2010). The impact of equating method and format representation of common items on the adequacy of mixed-format test equating using nonequivalent groups (Publication No. 3422144) [Doctoral dissertation, University of Iowa, Iowa]. ProQuest Dissertations and Theses Global.
Hambleton, R. K., Swaminathan, H., and Rogers, H. J. (1991). Fundamentals of item response theory. Sage.
Han, K. T. (2008). Impact of item parameter drift on test equating and proficiency estimates. (Publication No. 3325324) [Doctoral dissertation, University of Massachusetts, Amherst]. ProQuest Dissertations and Theses Global.
Han, T., Kolen, M. J., and Pohlmann, J. (1997). A comparison among IRT true- and observed score equating and traditional equipercentile equating. Applied Measurement in Education, 10(2), 105-121. https://doi.org/10.1207/s15324818ame1002_1
Hanson, B. A., and Beguin, A. A. (2002). Obtaining a common scale for item response theory item parameters using separate versus concurrent estimation in the common-item equating design. Applied Psychological Measurement, 26(1), 3– 24. https://doi.org/10.1177/0146621602026001001
Harris, D. J., and Crouse, J. D. (1993). A study of criteria used in equating. Applied Measurement in Education, 6(3), 195–240. https://doi.org/10.1207/s15324818ame0603_3
Harris, D. J., and Kolen, M. J. (1986). Effect of examinee group on equating relationships. Applied Psychological Measurement, 10(1), 35-43. https://doi.org/10.1177/014662168601000103
Harwell, M., Stone, C. A., Hsu, T. C., and Kirisci, L. (1996). Monte Carlo studies in item response theory. Applied Psychological Measurement, 20 (2), 101-125. https://doi.org/10.1177/014662169602000201
He, Q. (2010). Maintaining standards in on-demand testing using item response theory (No.10/4724). Office of Qualifications and Examinations Regulation. https://assets.publishing.service.gov.uk/government/uploads/system/uploads/att achment_data/file/605861/0210_QingpingHe_Maintaining-standards.pdf
He, Y. (2011). Evaluating equating properties for mixed-format tests (Publication No. 3461151) [Doctoral dissertation, University of Iowa, Iowa City]. ProQuest Dissertations and Theses Global.
Holland, P. W., and Dorans, N. J. (2006). Linking and equating. In R. L. Brennan (Eds.), Educational Measurement (4th ed., pp. 187-220). Praeger.
Holland, P. W., Dorans, N. J., and Petersen, N. S. (2007). Equating test scores. In C. R. Rao and S. Sinharay (Eds.), Handbook of statistics (pp. 169-203). Elsevier. https://doi.org/10.1016/S0169-7161(06)26006-1
Hills, J. R., Subhiyah, R. G., and Hirsch, T. M. (1988). Equating minimumcompetency tests: comparisons of methods. Journal of Educational Measurement, 25(3), 221- 231. https://www.jstor.org/stable/1434501
Hu, H., Rogers, T. W., and Vukmirovic, Z. (2008). Investigation of IRT-based equating methods in the presenceof outlier common items. Applied Psychological Measurement, 32(4), 311-333. https://doi.org/10.1177/0146621606292215
Ironson, G. H. (1983). Using item response theory to measure bias. In R.K. Hambleton (Eds.), Applications of item response theory (2nd ed., pp. 155–174). Educational Research Institute of British Columbia.
Kang, T., and Petersen, N. S. (2009, April, 14-16). Linking item parameters to a base scale. [Paper presentation]. National Council on Measurement in Education 2009 Annual Meeting, San Diego, CA, United States.
Karkee, T. B., and Wright, K. R. (2004, April, 12-16). Evaluation of linking methods for placing three parameter logistic item parameter estimates onto a oneparameter scale. [Paper presentation]. American Educational Research Association 2004 Annual Meeting, San Diego, CA, United States.
Kaskowitz, G. S., and De Ayala, R. J. (2001). The effect of error in item parameter estimates on the test response function method of linking. Applied Psychological Measurement, 25(1), 39-52. https://doi.org/10.1177/01466216010251003
Kolen, M. J. (1981). Comparison of traditional and Item Response Theory methods for Equating Tests. Journal of Educational Measurement, 18(1), 1-11. https://www.jstor.org/stable/1434813
Kolen, M. J. (1985). Standard errors of tucker equating. Applied Psychological Measurement, 9(2), 209-223. https://doi.org/10.1177/014662168500900209
Kolen, M. J. (1988). An NCME instructional module on traditional equating methodology. Educational Measurement: Issues and Practice, 7, 29-36. https://eric.ed.gov/?id=EJ388096
Kolen, M. J. (2007). Data collection designs and linking procedures. In N. J. Dorans, M. Pommerich, and Holland, P. W. (Eds.), Linking and aligning scores and scales (2nd ed. pp. 31-55). Springer. https://doi.org/10.1007/978-0-387- 49771-6_3
Kolen, M. J., and Brennan, R. L. (1995). Test Equating: methods and practices. Springer Verlag.
Kolen, M. J., and Brennan, R. L. (2004). Test equating, scaling, and linking: Methods and practices. Springer.
Kilmen, S. (2010). Madde tepki kuramına dayalı test eşitleme yöntemlerinden kestirilen eşitleme hatalarının örneklem büyüklüğü ve yetenek dağılımına göre karşılaştırılması (Tez Numarası: 279926) [Yüksek lisans tezi, Ankara Üniversitesi]. Yükseköğretim Kurulu Ulusal Tez Merkezi. https://tez.yok.gov.tr/UlusalTezMerkezi/
Kim, S., and Cohen, A. S. (1998). A comprasion of linking and concurrent calibration under item response theory. Applied Psychological Measurement, 22(2), 131- 143. https://doi.org/10.1177/01466216980222003
Kim, S., and Hanson, B. A. (2002). Test equating under the multiple-choice model. Applied Psychological Measurement, 26(3), 255-270. https://doi.org/10.1177/0146621602026003002
Kim, S., and Kolen, M. J. (2006). Robustness to format effects of IRT linking methods for mixed-format tests. Applied Measurement in Education, 19(4), 357-381. https://doi.org/10.1207/s15324818ame1904_7
Kim, S., and Lee, W. C. (2004). IRT scale linking methods for mixed-format tests. (No.5). American College Testing. https://www.act.org/content/dam/act/unsecured/documents/ACT_RR2004- 5.pdf
Kim, S., and Lee, W. (2006). An extension of four IRT linking methods for mixedformat tests. Journal of Educational Measurement, 43(1), 53-76. https://www.jstor.org/stable/20461809
Lee, Y. S. (2007). A comparison of methods for nonparametric estimation of item characteristic curves for binary items. Applied Psychological Measurement, 31(2), 121-134. https://doi.org/10.1177/0146621606290248
Lee, W. C., and Ban, J. C. (2009). Comparison of IRT linking procedures. Applied Measurement in Education, 23(1), 23-48. https://doi.org/10.1080/08957340903423537
Lee, G., and Fitzpatrick, A. R. (2008). A new approach to test score equating using item response theory with fixed c-parameters. Asia Pacific Education Review, 9(3), 248–261. https://www.springer.com/journal/12564
Li, D. (2009). Developing a common scale for testlet model parameter estimates under the common-item nonequivalent groups design (Publication No. 3359398) [Doctoral dissertation, University of Maryland, Maryland]. ProQuest Dissertations and Theses Global.
Li, Y. H., and Lissitz, R. W. (2000). An evaluation of the accuracy of multidimensional IRT linking. Applied Psychological Measurement, 24 (2), 115-138. https://doi.org/10.1177/01466216000242002
Liou, M., Cheng, P. E., and Johnson, E. G. (1997). Standard errors of the Kernel equating methods under the common-item design. Applied Psychological Measurement, 21 (4), 349-369. https://doi.org/10.1177/01466216970214005
Livingston, S. A., and Kim, S. (2010). Random-Groups equating with samples of 50 to 400 test takers. Journal of Educational Measurement, 47(2), 175–185. https://www.jstor.org/stable/20778946
Lord, F. M. (1983). Statistical bias in maximum likelihood estimators of item parameters. Psychometrika, 48(3), 477-482. https://doi.org/10.1007/BF02293684
Lord, F. M., and Wingersky, M. S. (1984). Comparison of IRT true-score and equipercentile observed-score equatings. Applied Psychological Measurement, 8(4),453-461. https://doi.org/10.1177/014662168400800409
Loyd, B. H., and Hoover, H. D. (1980). Vertical equating using the Rasch model. Journal of Educational Measurement, 17(3), 179-193. https://www.jstor.org/stable/1434833
Marco, G. L. (1977). Item Characteristic Curve Solutions to Three Intractable Testing Problems. Journal of Educational Measurement, 14(2), 139-160. http://www.jstor.org/stable/1434012
Meng, Y. (2012). Comparison of Kernel equating and item response theory equating methods (Publication No. 3518262) [Doctoral dissertation, University of Massachusetts, Amherst]. ProQuest Dissertations and Theses Global.
Michaelides, M. P. (2003, April, 21-25). Sensitivity of IRT equating to the behavior of test equating items. [Paper presentation]. American Educational Research Association 2003 Annual Meeting, Chicago, Illinois, United States.
Mohandas, R. (1996). Test equating, problems and solutions: Equating English test forms for the Indonesian junior secondary schoool final examination administered in 1994 [Doctoral dissertation, Flinders Institute of Technology]. https://flinders-primo.hosted.exlibrisgroup.com/primoexplore/search?vid=FUL&lang=en_US
Ngudgratoke, S. (2009). An investigation of using collateral information to reduce equating biases of the post-stratification equating method (Publication No. 3381312) [Doctoral dissertation, Michigan State University-Michigan]. ProQuest Dissertations and Theses Global.
Norman-Dvorak, R. K. (2009). A comparison of Kernel equating to the test characteristic curve methods (Publication No. 3350452) [Doctoral dissertation, University of Nebraska-Lincoln]. ProQuest Dissertations and Theses Global.
Nozawa, Y. (2008). Comparison of parametric and nonparametric IRT equating methods under the common-item nonequivalent groups design (Publication No. 3347237) [Doctoral dissertation, The University of Iowa-Iowa City]. ProQuest Dissertations and Theses Global.
Ogasawara, H. (2000). Asymptotic standard errors of IRT equating coefficients using moments. Economic Review (Otaru University of Commerce), 51(1), 1-23. https://www.researchgate.net/publication/241025868
Partchev, I. (2016). Package “irtoys”. (Version 0.2.0). https://cran.rproject.org/web/packages/irtoys/irtoys.pdf
Petersen, N. S., Cook, L. L., and Stocking, M. L. (1983). IRT versus conventional equating methods: a comparative study of scale stability. Journal of Educational Statistics, 8(2), 137-156. https://www.jstor.org/stable/1164922
Petersen, N. S., Kolen, M. J., and Hoover, H. D. (1993). Scaling, norming and equating. In Linn, R. L. (Eds.) Educational measurement ( 2nd. pp. 221-262). The Oryx.
Rizopoulos, D. (2015). Package “ltm”. https://cran.rproject.org/web/packages/ltm/ltm.pdf Ryan, J., and Brockmann, F. (2009). A prictitioner’s introduction to equating. https://files.eric.ed.gov/fulltext/ED544690.pdf
Sarkar, D. (2017). Package “lattice”. https://cran.rproject.org/web/packages/lattice/lattice.pdf
Skaggs, G. (1990). To match or not to match samples on ability for equating: A discussion of five articles. Applied Measurement in Education, 3 (1), 105- 113. https://doi.org/10.1207/s15324818ame0301_8
Skaggs, G. (2005). Accuracy of random groups equating with very small samples. Journal of Educational Measurement, 42(2),309–330. https://doi.org/10.1111/j.1745-3984.2005.00018.x
Skaggs, G., and Lissitz, R. W. (1986). IRT test equating: Relevant issues and a review of recent research. Review of Educational Research, 56(4), 495-529. https://doi.org/10.3102/00346543056004495
Speron, E. (2009). A comparison of metric linking procedures in item response theory (Publication No. 3370885) [Doctoral dissertation, University of Illinois-Illinois]. ProQuest Dissertations and Theses Global.
Spence, P. D. (1996). The effect of multidimensionality on unidimensional equating with item response theory (Publication No. 9703612) [Doctoral dissertation, University of Florida-Florida]. ProQuest Dissertations and Theses Global.
Stocking, M. L., and Lord, F. M. (1983). Developing a common metric in item response theory. Applied Psychological Measurement, 7(2), 201-210. https://doi.org/10.1177/014662168300700208
Sinharay, S., and Holland, P. W. (2008). The missing data assumptions of the nonequivalent groups with anchor test (neat) design and their implications for test equating (No.09-16). Educational Testing Service. https://files.eric.ed.gov/fulltext/ED507841.pdf
Qu, Y. (2007). The effect of weighting in Kernel equating using counter-balanced designs (Publication No. 3282191) [Doctoral dissertation, Michigan State University-East Lansing]. ProQuest Dissertations and Theses Global.
Tate, R. (2000). Performance of a proposed method for he linking of mixed-format tests with constructed response and multiple choice items. Journal of Educational Measurement, 37(4), 329-346. http://www.jstor.org/stable/1435244
Tsai, T. H. (1997,March, 24-27). Estimating minumum sample sizes in random groups equating. [Paper presentation]. National Council on Measurement in Education Association 1997 Annual Meeting,Chicago, Illinois, United States.
Tsai, T. H., Hanson, A. B., Kolen, J. M., and Forsyth, A. R. (2001). A comparison of bootstrap standard errors of IRT equating methods for the common-item nonequivalent groups design. Applied Measurement in Education, 14(1), 17–30. https://doi.org/10.1207/S15324818AME1401_03
Uysal, İ. (2014). Madde Tepki Kuramı’na dayalı test eşitleme yöntemlerinin karma modeller üzerinde karşılaştırılması (Tez Numarası: 370226) [Yüksek lisans tezi, Abant İzzet Baysal Üniversitesi]. Yükseköğretim Kurulu Ulusal Tez Merkezi. https://tez.yok.gov.tr/UlusalTezMerkezi/
von Davier, A. A. (2008). New results on the linear equating methods for the nonequivalent groups design. Journal of Educational and Behavioral Statistics, 33(2), 186-203. https://www.jstor.org/stable/20172112
von Davier, A. A. (2010). Statistical Models For Test Equating, Scaling and Linking. Springer.
von Davier, A. A., and Wilson, C. (2007). IRT true-score test equating: A guide through assumptions and applications. Educational and Psychological Measurement, 67(6), 940-957. https://doi.org/10.1177/0013164407301543
Walker, M. E., and Kim, S. (2010, April). Linking mixed-format tests using multiple choice anchors. [Paper presentation]. National Council on Measurement in Education Association 2010 Annual Meeting, San Diego, CA, United States.
Wang, X. (2012). Effect of simple size on IRT equating of uni-dimentional tests in common item non-equivalent group design: a monte carlo simulation study [Doctoral dissertation, Virginia Institute of Technology]. https://vtechworks.lib.vt.edu/handle/10919/37555
Way, W. D., and Tang, K. L. (1991, April, 3-7). A comparison of four logistic model equating methods. [Paper presentation]. American Educational Research Association 1991 Annual Meeting, Chicago, Illinois, United States.
Weeks, J. P. (2010). Plink: An R package for linking mixed-format tests using IRTbased methods. Journal of Statistical Software, 35(12), 1-33. https://www.jstatsoft.org/article/view/v035i12
Wu, N., Huang, C-Y., Huh, N., and Harris, D. (2009, April, 12-13). Robustness in using multiple choice iteOS as an external anchor for constructed-response test equating. [Paper presentation]. National Council on Measurement in Education Association 2009 Annual Meeting, San Diego, CA, United States.
Yang, W. L. (1997). The effects of content homogeneity and equating method on the accuracy of common item test equating (Publication No. 9839718) [Doctoral dissertation, Michigan State University-Michigan]. ProQuest Dissertations and Theses Global.
Yang, W. L., and Houang, R. T. (1996, April, 11-13). The effect of anchor length and equating method on the accuracy of test equating: comparisons of linear and IRT-based equating using an anchor-item design. [Paper presentation]. American Educational Research Association 1996 Annual Meeting, New York City, New York, United States.
Zeng, L. (1991). Standard errors of linear equating for the single-group design. (No.91-4). American College Testing. https://www.act.org/content/dam/act/unsecured/documents/ACT_RR91-04.pdf
Zhao, Y. (2008). Approaches for addressing the fit of item response theory models to educational test data. (Publication No.3337019) [Doctoral dissertation, University of Massachusett-Amherst]. ProQuest Dissertations and Theses Global.