Theses Doctoral

Investigating the combined effects of rater expertise, working memory capacity, and cognitive functionality on the scoring of second language speaking performance

Han, Qie

In L2 performance assessment, raters can significantly affect test validity due to rater variability, a source of construct-irrelevant variance in scores caused by differences in raters’ characteristics rather than test takers’ ability. To improve scoring validity, we must investigate what rater characteristics are likely to contribute to rater variability. The current study thus investigated the combined effects of three major rater characteristics, i.e., rater expertise, working memory capacity (WMC), and cognitive functionality, on raters’ scoring performance in L2 speaking assessment. Exploring these questions may increase our understanding of what rater-associated factors contribute to rater variability, thereby shedding light on rater selection, training, and scoring practices.

To this end, 90 raters from the US and the UK participated in two parts of the study. In Part I, the 90 raters completed a rater background survey designed to measure their L2 performance assessment-related experience, scored 27 responses from the Aptis speaking test, and completed one verbal working memory task. Hierarchical regression analyses were conducted to explore: 1) the relative contributions of rater expertise and WMC to scoring performance, and 2) any possible interaction between the two characteristics in their joint influences on scoring performance. Results from the analysis indicate that rater expertise had a significant effect on raters’ scoring accuracy. However, WMC was not found to significantly influence raters’ scoring performance. In addition, no significant interaction was found between rater expertise and WMC, which suggests independent influences of these two characteristics on scoring performance.

In Part II, six out of the 90 raters were randomly selected to participate in a cognitive lab session, where they scored three Aptis spoken responses and verbally reported their thinking process during scoring. The raters’ reports were coded and analyzed based on a hypothesized taxonomy of rater strategies invoked in the L2 scoring process. Fourteen major strategies were identified from the raters’ verbal reports. Differences were also found in the expert and novice raters’ quantity and quality of strategy use. These findings have revealed the mental mechanisms underlying raters’ scoring performance and associated differences in the raters’ strategy use to different levels of rater expertise.


  • thumnail for Han_tc.columbia_0055E_11099.pdf Han_tc.columbia_0055E_11099.pdf application/pdf 2.17 MB Download File

More About This Work

Academic Units
Arts and Humanities
Thesis Advisors
Purpura, James Enos
Ed.D., Teachers College, Columbia University
Published Here
July 30, 2020