Test score equating and item anchoring for high stakes examination

Test score equating and item anchoring for high stakes examination

Chieng Zouh Fong* lalabai1108@gmail.com University of Malaya
Tong Yeah Chuen tongyc@yahoo.com University of Malaya
Summary: 
Test equating becomes essential to safeguard the test fairness for sitting for the actual national examination. Thus, this paper describes a proposal to help teachers in Malaysia to ascertain the relative efficiency of test score equating methods in comparing students’ high stakes examinations. The proposal addresses the practical implications of score equating by describing aspects of equating and item anchoring process which can be used by teachers. This study examined Principles of Accounting (PA) subject with Rasch measurement framework for dichotomous data analysis. A nonexperimental quantitative research approach was adopted in which a set of equivalent test instrument were administered to two different groups of respondents comprising 429 students. Data collection was through stratified random sampling method and analysed using Winstep software. Results showed a good fit study by using Common Item Non-Equivalent Group Design (CINEG) also named as Non-equivalent Groups with Anchor Test (NEAT) design. Both test forms were reasonably predictable good fit of measurement. No single student’s destiny should rely upon a single test paper (Wu et al., 2016). Hence, multiple sets of equivalent test papers should be developed by teachers in schools with the same standard as the actual exam papers. Subsequently, students will be more well prepared for the national examination and will be able to achieve desired grades.
Keywords: 
Test equating
item anchoring
rasch model
test fairness.
Refers: 

[1] Cook, L. L., & Peterson, N. S. (2015). Problems related to the use of conventional and item response theory equating methods in less than optimal circumstances. Applied Psychological Measurement, 11(3), 225–244.

[2] Cook L. L. & Eignor, D. R. (1991). An NCME instructional module on IRT equating methods. Educational Measurement: Issues and Practice, 12(3), 38–47. https://doi.org/10.1111/j.1745-3992.1993.tb00543.x

[3] Finklestein, I. E. (1913). The marking system in theory and practice. Baltimore, MD: Warwick & York, Inc.

[4] Fischer, L., Rohm, T., Carstensen, C. H., & Gnambs, T. (2021). Linking of rasch-scaled tests: Consequences of limited item pools and model misfit. Frontiers in Psychology, 12. https://doi.org/10.3389/fpsyg.2021. 633896

[5] Harris, D. (1989). Comparison of 1-, 2-, and 3-Parameter IRT models. Educational Measurement: Issues and Practice, 8(1), 35–41. https://doi.org/ 10.1111/j.1745- 3992.1989.tb00313.x

[6] Harris, D., & Kolen, J. M. (1990). A Comparison of two equipercentile equating methods for common item equating. In Educational and Psychological Measurement - EDUC PSYCHOL MEAS (Vol. 50). https://doi.org/10.1177/0013164490501006

[7] Kopp, J. P., & Jones, A. T. (2020). Impact of item parameter drift on rasch scale stability in small samples over multiple administrations. Https://Doi.Org/10.1080/08 957347.2019.1674303, 33(1), 24–33. https://doi.org/ 10.1080/08957347.2019.1674303

[8] Hayes, N. (2021). Doing psychological research. In Open University Press (2nd ed.). Open University Press. https://books.google.com.my/books

[9] Kolen, M. J., & Brennan, R. L. (2014). Test equating, scaling, and linking: Methods and practices: Third edition. Test Equating, Scaling, and Linking: Methods and Practices: Third Edition, January 2006, 1–566. https://doi.org/10.1007/978-1-4939-0317-7

[10] Linacre, J. M. (2021). Winsteps rasch measurement computer program user’s guide. Winsteps.Com

[11] Lord, F. M. (1982). The standard error of equipercentile equating. Journal of Educational Statistics, 1(3), 165–174.

[12] Loyd, B. H., & Hoover, H. D. (1980). Vertical equating using the dichotomous rasch model. Educational Measurement, 17(Fall 1980), 179–193.

[13] Marco, G. L., Petersen, N. S., & Elizabeth, E. S. (1983). A test of the adequacy of curvilinear score equating models. In New Horizons in Testing (pp. 147–177). Elsevier. https://doi.org/10.1016/B978-0-12-742780- 5.50018-4

[14] McKinsey & Company. (2020). How Covid-19 has pusched companies over the technology tipping point. https://www.mckinsey.com/capabilities/strategy-andcorporate-finance/our-insights/how-covid-19%20 has-pushed-companies-over-the%20technologytipping-point-and-transformed-business-forever

[15] Messick, S. (1995). Validity of Psychological Assessment. Validation of Inferences from Persons’ Responses and Performances as Scientific Inquiry into Score Meaning. American Psychologist, 5D, 741-749. http://dx.doi.org/10.1037/0003-066X.50.9.741

[16] Ministry of Education. (2021). Quick facts 2021: Malaysia educational statistics. Educational Planning and Research Division.

[17] Mislevy, R. J., & Bock, R. D. (1990). BILOG 3: Item analysis and test scoring with binary logistic models. Mooresville, India. https://doi.org/10.117 7%2F01466216970214006

[18] Rebecca Rajaendram. (2022, June 2). PT3 exam abolished, says education minister | The Star. The STAR. https:// www.thestar.com.my/news/nation/2022/06/02/pt3- exam-abolished-says-education-minister

[19] Tierney, R. D. (2013). Fairness in classroom assessment. In J. H. McMillan (Ed.), SAGE Handbook of Research on Classroom Assessment (pp. 125-144). Thousand Oaks, CA: SAGE Publications.

[20] Tierney, R. D. (2013). Fairness in classroom assessment. In J. H. McMillan (Ed.), SAGE Handbook of Research on Classroom Assessment (pp. 125-144). Thousand Oaks, CA: SAGE Publications.

Articles in Issue