Academic Commons Search Results
http://academiccommons.columbia.edu/catalog.rss?f%5Bdepartment_facet%5D%5B%5D=Measurement+and+Evaluation&q=&rows=500&sort=record_creation_date+desc
Academic Commons Search Resultsen-usExploring Skill Condensation Rules for Cognitive Diagnostic Models in a Bayesian Framework
http://academiccommons.columbia.edu/catalog/ac:193918
Luna Bazaldua, Diego A.http://dx.doi.org/10.7916/D8NP247CWed, 27 Jan 2016 00:00:00 +0000Diagnostic paradigms are becoming an alternative to normative approaches in educational assessment. One of the principal objectives of diagnostic assessment is to determine skill proficiency for tasks that demand the use of specific cognitive processes. Ideally, diagnostic assessments should include accurate information about the skills required to correctly answer each item in a test, as well as any additional evidence about the interaction between those cognitive constructs. Nevertheless, little research in the field has focused on the types of interactions (i.e., the condensation rules) among skills in models for cognitive diagnosis. The present study introduces a Bayesian approach to determine the underlying interaction among the skills measured by a given item when comparing among models with conjunctive, disjunctive, and compensatory condensation rules. Following the reparameterization framework proposed by DeCarlo (2011), the present study includes transformations for disjunctive and compensatory models. Next, a methodology that compares between pairs of models with different condensation rules is presented; parameters in the model and their distribution were defined considering former Bayesian approaches proposed in the literature. Simulation studies and empirical studies were performed to test the capacity of the model to correctly identify the underlying condensation rule. Overall, results from the simulation study showed that the correct condensation rule is correctly identified across conditions. The results showed that the correct condensation rule identification depends on the item parameter values used to generate the data and the use of informative prior distributions for the model parameters. Latent class sizes parameters for the skills and their respective hyperparameters also showed a good recovery in the simulation study. The recovery of the item parameters presented limitations, so some guidelines to improve their estimation are presented in the results and discussion sections. The empirical studies highlighted the usefulness of this approach in determining the interaction among skills using real items from a mathematics test and a language test. Despite the differences in their area of knowledge and Q-matrix structure, results indicated that both tests are composed in a higher proportion of conjunctive items that demand the mastery of all skills.Educational tests and measurements, Quantitative psychology, Educational evaluation, Educational evaluation, Educational tests and measurements, Education--Mathematical modelsdal2159Measurement and EvaluationDissertationsEstimating the Q-matrix for Cognitive Diagnosis Models in a Bayesian Framework
http://academiccommons.columbia.edu/catalog/ac:176107
Chung, Meng-tahttp://dx.doi.org/10.7916/D857195BMon, 07 Jul 2014 00:00:00 +0000This research aims to develop an MCMC algorithm for estimating the Q-matrix in a Bayesian framework. A saturated multinomial model was used to estimate correlated attributes in the DINA model and rRUM. Closed-forms of posteriors for guess and slip parameters were derived for the DINA model. The random walk Metropolis-Hastings algorithm was applied to parameter estimation in the rRUM. An algorithm for reducing potential label switching was incorporated into the estimation procedure. A method for simulating data with correlated attributes for the DINA model and rRUM was offered. Three simulation studies were conducted to evaluate the algorithm for Bayesian estimation. Twenty simulated data sets for simulation study 1 were generated from independent attributes for the DINA model and rRUM. A hundred data sets from correlated attributes were generated for the DINA and rRUM with guess and slip parameters set to 0.2 in simulation study 2. Simulation study 3 analyzed data sets simulated from the DINA model with guess and slip parameters generated from Uniform (0.1, 0.4). Results from simulation studies showed that the Q-matrix recovery rate was satisfactory. Using the fraction-subtraction data, an empirical study was conducted for the DINA model and rRUM. The estimated Q-matrices from the two models were compared with the expert-designed Q-matrix.Quantitative psychology and psychometrics, Statistics, Educational tests and measurementsHuman Development, Measurement and EvaluationDissertationsFactors Affecting Probability Matching Behavior
http://academiccommons.columbia.edu/catalog/ac:164326
Gao, Jiehttp://hdl.handle.net/10022/AC:P:21357Fri, 16 Aug 2013 00:00:00 +0000In life, people commonly face repeated decisions under risk or uncertainty. While normative economic models assume that people tend to make choices that maximize their expected utility, suboptimal behavior - in particular, probability matching - is frequently observed in research on repeated decisions. Probability matching is the tendency to match prediction probabilities of each outcome with the observed outcome probabilities in a random binary prediction task. For example, when people are faced with making with a sequence of predictions, such as repeatedly predicting the outcome of rolling a die with four sides colored green and two sides colored red, most people allocate about two-thirds of their predictions to green, and one-third to red. The optimal strategy, referred to as maximizing, would be to choose the outcome with the higher probability in every trial in the prediction task. Various causes for probability matching have been proposed during the past several decades. Here it is proposed that implicit adoption of a perfect prediction goal by decision makers might tend to elicit probability matching behavior. Thus, one factor that might affect the prevalence of probability matching behavior (investigated in Studies 1 and 2) is the type of performance goal. The manipulation in Study 1 contrasted single-trial prediction with prediction of four-trial sequences, which it is hypothesized might create an implicit perfect prediction goal for the sequence. In Study 2, three levels of goal were explicitly manipulated for each sequence: a perfect prediction goal, an 80% correct goal, and a 60% correct goal. In both studies it was predicted that more matching behavior would be observed for those who have a goal of perfect prediction than those who have a more reasonable (lower) goal. The results of both studies, conducted in an online worker marketplace, supported the goal-level hypothesis. The second factor proposed to affect the prevalence of probability matching is the type of conceptual schema describing the events to be predicted: independent events or complementary events. Study 3 investigated the effects of schema type and abstraction level of context on matching or maximizing behavior. Three abstraction levels of stories were included: abstract, concrete random devices, and real-world stories. The main hypothesis was that when the two options to be predicted are independent events, less matching and more maximizing behavior should be observed. Data from Study 3 supported the hypothesis that independent events tend to elicit more maximizing behavior. No effects of abstraction level were observed.Cognitive psychology, Quantitative psychology and psychometricsHuman Development, Measurement and EvaluationDissertationsSchematic Effects on Probability Problem Solving
http://academiccommons.columbia.edu/catalog/ac:174540
Gugga, Saranda Soniahttp://hdl.handle.net/10022/AC:P:20863Fri, 28 Jun 2013 00:00:00 +0000Three studies examined context effects on solving probability problems. Variants of word problems were written with cover stories which differed with respect to social or temporal schemas, while maintaining formal problem structure and solution procedure. In the first of these studies it was shown that problems depicting schemas in which randomness was inappropriate or unexpected for the social situation were solved less often than problems depicting schemas in which randomness was appropriate. Another set of two studies examined temporal and causal schemas, in which the convention is that events are considered in forward direction. Pairs of conditional probability (CP) problems were written depicting events E1 and E2, such that E1 either occurs before E2 or causes E2. Problems were defined with respect to the order of events expressed in CPs, so that P(E2|E1) represents the CP in schema-consistent, intact order by considering the occurrence of E1 before E2, while P(E1|E2) represents CP in schema-inconsistent, inverted order. Introductory statistics students had greater difficulty encoding CP for events in schema-inconsistent order than CP of events in conventional deterministic order. The differential effects of schematic context on solving probability problems identify specific conditions and sources of bias in human reasoning under uncertainty. In addition, these biases may be influential when evaluating empirical findings in a manner similar to that demonstrated in this paper experimentally, and may have implications for how social scientists are trained in research methodology.Cognitive psychology, Quantitative psychology and psychometrics, Educational psychologyssg34Human Development, Measurement and EvaluationDissertationsApplication of ordered latent class regression model in educational assessment
http://academiccommons.columbia.edu/catalog/ac:161911
Cha, Jisunghttp://hdl.handle.net/10022/AC:P:20599Thu, 06 Jun 2013 00:00:00 +0000Latent class analysis is a useful tool to deal with discrete multivariate response data. Croon (1990) proposed the ordered latent class model where latent classes are ordered by imposing inequality constraints on the cumulative conditional response probabilities. Taking stochastic ordering of latent classes into account in the analysis of data gives a meaningful interpretation, since the primary purpose of a test is to order students on the latent trait continuum. This study extends Croon's model to ordered latent class regression that regresses latent class membership on covariates (e.g., gender, country) and demonstrates the utilities of an ordered latent class regression model in educational assessment using data from Trends in International Mathematics and Science Study (TIMSS). The benefit of this model is that item analysis and group comparisons can be done simultaneously in one model. The model is fitted by maximum likelihood estimation method with an EM algorithm. It is found that the proposed model is a useful tool for exploratory purposes as a special case of nonparametric item response models and cross-country difference can be modeled as different composition of discrete classes. Simulations is done to evaluate the performance of information criteria (AIC and BIC) in selecting the appropriate number of latent classes in the model. From the simulation results, AIC outperforms BIC for the model with the order-restricted maximum likelihood estimator.Educational tests and measurements, Statistics, Mathematics educationjc2320Human Development, Measurement and EvaluationDissertationsPenalized Joint Maximum Likelihood Estimation Applied to Two Parameter Logistic Item Response Models
http://academiccommons.columbia.edu/catalog/ac:161745
Paolino, Jon-Paul Noelhttp://hdl.handle.net/10022/AC:P:20531Fri, 31 May 2013 00:00:00 +0000Item response theory (IRT) models are a conventional tool for analyzing both small scale and large scale educational data sets, and they are also used for the development of high-stakes tests such as the Scholastic Aptitude Test (SAT) and the Graduate Record Exam (GRE). When estimating these models it is imperative that the data set includes many more examinees than items, which is a similar requirement in regression modeling where many more observations than variables are needed. If this requirement has not been met the analysis will yield meaningless results. Recently, penalized estimation methods have been developed to analyze data sets that may include more variables than observations. The main focus of this study was to apply LASSO and ridge regression penalization techniques to IRT models in order to better estimate model parameters. The results of our simulations showed that this new estimation procedure called penalized joint maximum likelihood estimation provided meaningful estimates when IRT data sets included more items than examinees when traditional Bayesian estimation and marginal maximum likelihood methods were not appropriate. However, when the IRT datasets contained more examinees than items Bayesian estimation clearly outperformed both penalized joint maximum likelihood estimation and marginal maximum likelihood.Statisticsjnp2111Human Development, Measurement and EvaluationDissertationsExamining Uncertainty and Misspecification of Attributes in Cognitive Diagnostic Models
http://academiccommons.columbia.edu/catalog/ac:174822
Chen, Chen-Miao Carolhttp://hdl.handle.net/10022/AC:P:20451Fri, 24 May 2013 00:00:00 +0000In recent years, cognitive diagnostic models (CDMs) have been widely used in educational assessment to provide a diagnostic profile (mastery/non-mastery) analysis for examinees, which gives insights into learning and teaching. However, there is often uncertainty about the specification of the Q-matrix that is required for CDMs, given that it is based on expert judgment. The current study uses a Bayesian approach to examine recovery of Q-matrix elements in the presence of uncertainty about some elements. The first simulation examined the situation where there is complete uncertainty about whether or not an attribute is required, when in fact it is required. The simulation results showed that recovery was generally excellent. However, recovery broke down when other elements of the Q-matrix were misspecified. Further simulations showed that, if one has some information about the attributes for a few items, then recovery improves considerably, but this also depends on how many other elements are misspecified. A second set of simulations examined the situation where uncertain Q-matrix elements were scattered throughout the Q-matrix. Recovery was generally excellent, even when some other elements were misspecified. A third set of simulations showed that using more informative priors did not uniformly improve recovery. An application of the approach to data from TIMSS (2007) suggested some alternative Q-matrices.Quantitative psychology and psychometricscc2410Human Development, Measurement and EvaluationDissertationsDealing with Sparse Rater Scoring of Constructed Responses within a Framework of a Latent Class Signal Detection Model
http://academiccommons.columbia.edu/catalog/ac:161491
Kim, Sunheehttp://hdl.handle.net/10022/AC:P:20440Thu, 23 May 2013 00:00:00 +0000In many assessment situations that use a constructed-response (CR) item, an examinee's response is evaluated by only one rater, which is called a single rater design. For example, in a classroom assessment practice, only one teacher grades each student's performance. While single rater designs are the most cost-effective method among all rater designs, the lack of a second rater causes difficulties with respect to how the scores should be used and evaluated. For example, one cannot assess rater reliability or rater effects when there is only one rater. The present study explores possible solutions for the issues that arise in sparse rater designs within the context of a latent class version of signal detection theory (LC-SDT) that has been previously used for rater scoring. This approach provides a model for rater cognition in CR scoring (DeCarlo, 2005; 2008; 2010) and offers measures of rater reliability and various rater effects. The following potential solutions to rater sparseness were examined: 1) the use of parameter restrictions to yield an identified model, 2) the use of informative priors in a Bayesian approach, and 3) the use of back readings (e.g., partially available 2nd rater observations), which are available in some large scale assessments. Simulations and analyses of real-world data are conducted to examine the performance of these approaches. Simulation results showed that using parameter constraints allows one to detect various rater effects that are of concern in practice. The Bayesian approach also gave useful results, although estimation of some of the parameters was poor and the standard deviations of the parameter posteriors were large, except when the sample size was large. Using back-reading scores gave an identified model and simulations showed that the results were generally acceptable, in terms of parameter estimation, except for small sample sizes. The paper also examines the utility of the approaches as applicable to the PIRLS USA reliability data. The results show some similarities and differences between parameter estimates obtained with posterior mode estimation and with Bayesian estimation. Sensitivity analyses revealed that rater parameter estimates are sensitive to the specification of the priors, as also found in the simulation results with smaller sample sizes.Educational tests and measurementsshk2125Human Development, Measurement and EvaluationDissertationsAn Item Response Theory Approach to Causal Inference in the Presence of a Pre-intervention Assessment
http://academiccommons.columbia.edu/catalog/ac:188469
Marini, Jessicahttp://dx.doi.org/10.7916/D8WM1CR3Thu, 16 May 2013 00:00:00 +0000This research develops a form of causal inference based on Item Response Theory (IRT) to combat bias that occurs when existing causal inference methods are used under certain scenarios. When a pre-test is administered, prior to a treatment decision, bias can occur in causal inferences about the decision's effect on the outcome. This new IRT based method uses item-level information, treatment placement, and the outcome to produce estimates of each subject's ability in the chosen domain. Examining a causal inference research question in an IRT model-based framework becomes a model-based way to match subjects on estimates of their true ability. This model-based matching allows inferences to be made about a subject's performance as if they had been in the opposite treatment group. The IRT method is developed to combat existing methods' downfalls such as relying on conditional independence between pre-test scores and outcomes. Using simulation, the IRT method is compared to existing methods under two different model scenarios in terms of Type I and Type II errors. Then the method's parameter recovery is analyzed followed by accuracy of treatment effect evaluation. The IRT method is shown to out perform existing methods in an ability-based scenario. Finally, the IRT method is applied to real data assessing the impact of advanced STEM in high school on a students choice of major, and compared to existing alternative approaches.Educational tests and measurements, Statisticsjpm2120Human Development, Measurement and EvaluationDissertationsExamining the Impact of Examinee-Selected Constructed Response Items in the Context of a Hierarchical Rater Signal Detection Model
http://academiccommons.columbia.edu/catalog/ac:186227
Patterson, Brian Francishttp://dx.doi.org/10.7916/D8X929DCTue, 14 May 2013 15:36:13 +0000Research into the relatively rarely used examinee-selected item assessment designs has revealed certain challenges. This study aims to more comprehensively re-examine the key issues around examinee-selected items under a modern model for constructed-response scoring. Specifically, data were simulated under the hierarchical rater model with signal detection theory rater components (HRM-SDT; DeCarlo, Kim, and Johnson, 2011) and a variety of examinee-item selection mechanisms were considered. These conditions varied from the hypothetical baseline condition--where examinees choose randomly and with equal frequency from a pair of item prompts--to the perhaps more realistic and certainly more troublesome condition where examinees select items based on the very subject-area proficiency that the instrument intends to measure. While good examinee, item, and rater parameter recovery was apparent in the former condition for the HRM-SDT, serious issues with item and rater parameter estimation were apparent in the latter. Additional conditions were considered, as well as competing psychometric models for the estimation of examinee proficiency. Finally, practical implications of using examinee-selected item designs are given, as well as future directions for research.Educational tests and measurementsbfp2103Measurement and Evaluation, Human DevelopmentDissertationsBayesian Multidimensional Scaling Model for Ordinal Preference Data
http://academiccommons.columbia.edu/catalog/ac:161114
Matlosz, Kerry McCloskeyhttp://hdl.handle.net/10022/AC:P:20304Tue, 14 May 2013 00:00:00 +0000The model within the present study incorporated Bayesian Multidimensional Scaling and Markov Chain Monte Carlo methods to represent individual preferences and threshold parameters as they relate to the influence of survey items popularity and their interrelationships. The model was used to interpret two independent data samples of ordinal consumer preference data related to purchasing behavior. The objective of the procedure was to provide an understanding and visual depiction of consumers' likelihood of having a strong affinity toward one of the survey choices, and how other survey choices relate to it. The study also aimed to derive the joint spatial representation of the subjects and products represented by the dissimilarity preference data matrix within a reduced dimensionality. This depiction would aim to enable interpretation of the preference structure underlying the data and potential demand for each product. Model simulations were created both from sampling the normal distribution, as well as incorporating Lambda values from the two data sets and were analyzed separately. Posterior checks were used to determine dimensionality, which were also confirmed within the simulation procedures. The statistical properties generated from the simulated data confirmed that the true parameter values (loadings, utilities, and latititudes) were recovered. The model effectiveness was contrasted and evaluated both within real data samples and a simulated data set. The two data sets analyzed were confirmed to have differences in their underlying preference structures that resulted in differences in the optimal dimensionality in which the data should be represented. The Biases and MSEs of the lambdas and alphas provide further understanding of the data composition and Analysis of variance (ANOVA) confirmed the differences in MSEs related to changes in dimensions were statistically significant.Statisticskmm2159Human Development, Measurement and EvaluationDissertationsNonlinear penalized estimation of true Q-matrix in cognitive diagnostic models
http://academiccommons.columbia.edu/catalog/ac:160812
Xiang, Ruihttp://hdl.handle.net/10022/AC:P:20149Wed, 01 May 2013 00:00:00 +0000A key issue of cognitive diagnostic models (CDMs) is the correct identification of Q-matrix which indicates the relationship between attributes and test items. Previous CDMs typically assumed a known Q-matrix provided by domain experts such as those who developed the questions. However, misspecifications of Q-matrix had been discovered in the past studies. The primary purpose of this research is to set up a mathematical framework to estimate the true Q-matrix based on item response data. The model considers all Q-matrix elements as parameters and estimates them through EM algorithm. Two simulation designs are conducted to evaluate the feasibility and performance of the model. An empirical study is addressed to compare the estimated Q-matrix with the one designed by experts. The results show that the model performs well and is able to identify 60% to 90% of correct elements of Q-matrix. The model also indicates possible misspecifications of the designed Q-matrix in the fraction subtraction test.Statistics, Education, Psychologyrx2107Human Development, Measurement and EvaluationDissertationsA Bayesian Multidimensional Scaling Model for Partial Rank Preference Data
http://academiccommons.columbia.edu/catalog/ac:160395
Tanaka, Kyokohttp://hdl.handle.net/10022/AC:P:20044Tue, 30 Apr 2013 00:00:00 +0000There has been great advancement on research for preferential choice in field of marketing. When we look at preferential choice data, there are two components to consider: the individuals and the items. Coombs (1950; 1964) introduced the unfolding technique on preferential choice data. In 1960, Bennett and Hays went on to create a multidimensional unfolding model. Hojo (1997;1998) showed rank data could be used in multidimensional scaling, however he did not implement a Bayesian technique. In 2010, Fong, DeSarbo, Park, and Scott proposed a new Bayesian vector Multidimensional Scaling (MDS) model which was applied to data from a five-point Likert scale survey. This paper focused on Bayesian approach choice behavior multidimensional space model for the analysis of partially ranked data (rank top 3 from J data) to provide a joint space of individuals and products, using MCMC procedure. The procedure is similar to what Fong, DeSarbo, Park, and Scott (2010) did but this study used partial rank data instead of Likert scale data. The goal of this study was to create a probability-based model that calculates the average product utility which indicates how popular the product is. Lambdas or the item loadings are the direction of the products and thetas are the direction for the individuals. In addition, this study dealt with rotational invariance by calculating the optimal lambda values for each iteration and each dimension by flipping the sign so it approaches the average value. To determine the number of dimensions of the datasets, the sum of squared loadings were calculated. We applied the MCMC procedure to simulated data in which we sampled the loadings from the normal distribution as well as loadings from the real datasets. In addition, we applied the MCMC procedure to the real dataset and created a multidimensional space for the products.Quantitative psychology and psychometricskjt2007Human Development, Measurement and EvaluationDissertationsOn the Use of Covariates in a Latent Class Signal Detection Model, with Applications to Constructed Response Scoring
http://academiccommons.columbia.edu/catalog/ac:146692
Wang, Zijian Geraldhttp://hdl.handle.net/10022/AC:P:13156Mon, 07 May 2012 00:00:00 +0000A latent class signal detection (SDT) model was recently introduced as an alternative to traditional item response theory (IRT) methods in the analysis of constructed response data. This class of models can be represented as restricted latent class models and differ from the IRT approach in the way the latent construct is conceptualized. One appeal of the signal detection approach is that it provides an intuitive framework from which psychological processes governing rater behavior can be better understood. The present study developed an extension of the latent class SDT model to include covariates and examined the performance of the resulting model. Covariates can be incorporated into the latent class SDT model in three ways: 1) to affect latent class membership, 2) conditional response probabilities and 3) both latent class membership and conditional response probabilities. In each case, simulations were conducted to investigate both parameter recovery and classification accuracy of the extended model under two competing rater designs; in addition, implications of ignoring covariate effects and covariate misspecification were explored. Here, the ability of information criteria, namely the AIC, small sample adjusted AIC and BIC, in recovering the true model with respect to how covariates are introduced was also examined. Results indicate that parameters were generally well recovered in fully-crossed designs; to obtain similar levels of estimation precision in incomplete designs, sample size requirements were comparatively higher and depend on the number of indicators used. When covariate effects were not accounted for or misspecified, results show that parameter estimates tend to be severely biased, which in turn reduced classification accuracy. With respect to model recovery, the BIC performed the most consistently amongst the information criteria considered. In light of these findings, recommendations were made with regard to sample size requirements and model building strategies when implementing the extended latent class SDT model.Educational tests and measurementszgw2Human Development, Measurement and EvaluationDissertationsThe Relation between Uncertainty in Latent Class Membership and Outcomes in a Latent Class Signal Detection Model
http://academiccommons.columbia.edu/catalog/ac:146637
Cheng, Zhifenhttp://hdl.handle.net/10022/AC:P:13139Fri, 04 May 2012 00:00:00 +0000Latent class variables are often used to predict outcomes. The conventional practice is to first assign observations to one of the latent classes based on the maximum posterior probabilities. The assigned class membership is then treated as an observed variable and used in predicting the outcomes. This widely used classify-analyze strategy ignores the uncertainty of being in a certain latent class for the observations. Once an observation is classified to the latent class with the highest posterior probability, its probability of being in the assigned class is treated as being one. In addition, once observations are classified to the latent class with the highest posterior probability, their representativeness of the class becomes the same because they will all have a probability of one of being in the assigned class. Finally, standard errors are underestimated because the residual uncertainty about the latent class membership is ignored. This dissertation used simulation studies and an analysis of a real-world data set to compare five commonly adopted approaches (most likely class regression, probability regression, probability-weighted regression, pseudo-class regression, and the simultaneous approach) for measuring the association between a latent class variable and outcome variables to see which one can better account for the uncertainty in latent class membership in such a situation. The model considered in the study was a latent class extension of the signal detection model (LC-SDT) by DeCarlo, which has proved to be able to address certain measurement issues in the educational field, more specifically, rater issues involved in essay grading such as rater effects and rater reliability. An LC-SDT model has the potential for wide applications in education as well as other areas. Therefore it is important to explore the issue of accounting for uncertainty in latent class membership within this framework. Three ordinal outcome variables having a negative, weak, and strong association with the latent class variable were considered in the simulations. Results of the simulations showed that the simultaneous approach performed best in obtaining unbiased parameter estimates. It also yielded larger standard errors than the other approaches which have been found by previous research to underestimate standard errors. Even though the simultaneous approach has its advantages, including outcome variables in a latent class model can affect parameters of the response variables. Therefore, cautions need to be taken when using this approach. The analysis results of the real-world data set confirmed the trends observed in the simulation studies.Quantitative psychology and psychometrics, Educational psychology, Statisticszc2133Human Development, Measurement and EvaluationDissertationsRater Drift in Constructed Response Scoring via Latent Class Signal Detection Theory and Item Response Theory
http://academiccommons.columbia.edu/catalog/ac:132272
Park, Yoon Soohttp://hdl.handle.net/10022/AC:P:10394Tue, 17 May 2011 00:00:00 +0000The use of constructed response (CR) items or performance tasks to assess test takers' ability has grown tremendously over the past decade. Examples of CR items in psychological and educational measurement range from essays, works of art, and admissions interviews. However, unlike multiple-choice (MC) items that have predetermined options, CR items require test takers to construct their own answer. As such, they require the judgment of multiple raters that are subject to differences in perception and prior knowledge of the material being evaluated. As with any scoring procedure, the scores assigned by raters must be comparable over time and over different test administrations and forms; in other words, scores must be reliable and valid for all test takers, regardless of when an individual takes the test. This study examines how longitudinal patterns or changes in rater behavior affect model-based classification accuracy. Rater drift refers to changes in rater behavior across different test administrations. Prior research has found evidence of drift. Rater behavior in CR scoring is examined using two measurement models - latent class signal detection theory (SDT) and item response theory (IRT) models. Rater effects (e.g., leniency and strictness) are partly examined with simulations, where the ability of different models to capture changes in rater behavior is studied. Drift is also examined in two real-world large scale tests: teacher certification test and high school writing test. These tests use the same set of raters for long periods of time, where each rater's scoring is examined on a monthly basis. Results from the empirical analysis showed that rater models were effective to detect changes in rater behavior over testing administrations in real-world data. However, there were differences in rater discrimination between the latent class SDT and IRT models. Simulations were used to examine the effect of rater drift on classification accuracy and on differences between the latent class SDT and IRT models. Changes in rater severity had only a minimal effect on classification. Rater discrimination had a greater effect on classification accuracy. This study also found that IRT models detected changes in rater severity and in rater discrimination even when data were generated from the latent class SDT model. However, when data were non-normal, IRT models underestimated rater discrimination, which may lead to incorrect inferences on the precision of raters. These findings provide new and important insights into CR scoring and issues that emerge in practice, including methods to improve rater training.Quantitative psychology and psychometrics, Educational tests and measurements, Statisticsysp2102Human Development, National Center for Disaster Preparedness, Measurement and EvaluationDissertations