Theses Doctoral

Studies of Rater and Item Effects in Rater Models

Zhao, Yihan

The goal underlying educational testing is to measure psychological constructs in a particular domain and to produce valid inferences about examinees’ ability. To achieve this goal of getting a precise ability evaluation, test developers construct questions with different formats, such as multiple-choice (MC) items, and open-ended questions or constructed response (CR) test items, for example, essay items. In recent years, large-scale assessments have implemented CR items in addition to MC items as an essential component of the educational assessment landscape.

However, utilizing CR items in testing involves two main challenges, including rater effects and rater correlations. One challenge is the error added by human raters’ subjective judgments, such as rater severity and rater central tendency. Rater severity effect refers to the effect that raters may tend to give consistently low or high ratings that cause biased ability evaluation (Leckie & Baird, 2011). Central tendency describes when raters tend to use middle categories in the scoring rubric and avoid using extreme criteria (Saal et al., 1980). The second challenge is that multiple raters usually grade an examinee’s essay for quality control purposes; however, ratings based on the same item are correlated and need to be handled carefully by appropriate statistical procedures (Eckes, 2011; Kim, 2009).

To solve these problems, DeCarlo (2010) proposed an HRM-SDT model that extended the traditional signal detection theory (SDT) model used in the first level of HRM. The HRM-SDT model not only considers the hierarchical structure of rating data but also deals with various rater effects beyond rater severity. This research examined to what extent the HRM-SDT separates rater effects (i.e., rater severity and rater central tendency) from item effects (i.e., item difficulty). Accordingly, one goal of this study was to simulate various rater effects and item effects to investigate the performance of the HRM-SDT model with respect to separating these effects. The other goal was to compare the fit of the HRM-SDT model with one commonly used model in language assessments, the Rasch model, in different simulation conditions and to examine the difference between these two models in terms of segregating rater and item effects.

To answer these questions, Simulation A and Simulation B were conducted. In Simulation A, seven sets of parameters were varied in the first set of simulations. Simulation B addressed some questions of particular interest using another four sets of parameters, where both the rater and item parameters were simultaneously varied. This study found the HRM-SDT accurately recovered parameters, and clearly detected and separated changes in rater severity, rater central tendency, and item difficulty in most conditions.


  • thumnail for Zhao_columbia_0054D_15867.pdf Zhao_columbia_0054D_15867.pdf application/pdf 800 KB Download File

More About This Work

Academic Units
Measurement and Evaluation
Thesis Advisors
DeCarlo, Lawrence T.
Ph.D., Columbia University
Published Here
July 6, 2020