Academic Commons Search Results
http://academiccommons.columbia.edu/catalog.rss?f%5Bsubject_facet%5D%5B%5D=Educational+tests+and+measurements&q=&rows=500&sort=record_creation_date+desc
Academic Commons Search Resultsen-usImproving the Targeting of Treatment: Evidence from College Remediation
http://academiccommons.columbia.edu/catalog/ac:178925
Scott-Clayton, Judith E.; Crosta, Peter Michael; Belfield, Clivehttp://dx.doi.org/10.7916/D8T15284Wed, 22 Oct 2014 00:00:00 +0000At an annual cost of roughly $7 billion nationally, remedial coursework is one of the single largest interventions intended to improve outcomes for underprepared college students. But like a costly medical treatment with non-trivial side effects, the value of remediation overall depends upon whether those most likely to benefit can be identified in advance. This NBER working paper uses administrative data and a rich predictive model to examine the accuracy of remedial screening tests, either instead of or in addition to using high school transcript data to determine remedial assignment. The authors find that roughly one in four test-takers in math and one in three test-takers in English are severely mis-assigned under current test-based policies, with mis-assignments to remediation much more common than mis-assignments to college-level coursework. Using high school transcript information—either instead of or in addition to test scores—could significantly reduce the prevalence of assignment errors. Further, the choice of screening device has significant implications for the racial and gender composition of both remedial and college-level courses. Finally, if institutions took account of students’ high school performance, they could remediate substantially fewer students without lowering success rates in college-level courses.Higher education, Educational tests and measurementsjs3676, pmc2107, cb2001Economics and Education, Education Policy and Social Analysis, Community College Research Center, National Center for the Study of Privatization in EducationWorking papersStatistical Inference and Experimental Design for Q-matrix Based Cognitive Diagnosis Models
http://academiccommons.columbia.edu/catalog/ac:176169
Zhang, Stephaniehttp://dx.doi.org/10.7916/D8TQ5ZP5Mon, 07 Jul 2014 00:00:00 +0000There has been growing interest in recent years in using cognitive diagnosis models for diagnostic measurement, i.e., classification according to multiple discrete latent traits. The Q-matrix, an incidence matrix specifying the presence or absence of a relationship between each item in the assessment and each latent attribute, is central to many of these models. Important applications include educational and psychological testing; demand in education, for example, has been driven by recent focus on skills-based evaluation. However, compared to more traditional models coming from classical test theory and item response theory, cognitive diagnosis models are relatively undeveloped and suffer from several issues limiting their applicability. This thesis exams several issues related to statistical inference and experimental design for Q-matrix based cognitive diagnosis models. We begin by considering one of the main statistical issues affecting the practical use of Q-matrix based cognitive diagnosis models, the identifiability issue. In statistical models, identifiability is prerequisite for most common statistical inferences, including parameter estimation and hypothesis testing. With Q-matrix based cognitive diagnosis models, identifiability also affects the classification of respondents according to their latent traits. We begin by examining the identifiability of model parameters, presenting necessary and sufficient conditions for identifiability in several settings. Depending on the area of application and the researcher's degree of control over the experiment design, fulfilling these identifiability conditions may be difficult. The second part of this thesis proposes new methods for parameter estimation and respondent classification for use with non-identifiable models. In addition, our framework allows consistent estimation of the severity of the non-identifiability problem, in terms of the proportion of the population affected by it. The implications of this measure for the design of diagnostic assessments are also discussed.Statistics, Educational tests and measurements, Quantitative psychology and psychometricsStatisticsDissertationsEstimating the Q-matrix for Cognitive Diagnosis Models in a Bayesian Framework
http://academiccommons.columbia.edu/catalog/ac:176107
Chung, Meng-tahttp://dx.doi.org/10.7916/D857195BMon, 07 Jul 2014 00:00:00 +0000This research aims to develop an MCMC algorithm for estimating the Q-matrix in a Bayesian framework. A saturated multinomial model was used to estimate correlated attributes in the DINA model and rRUM. Closed-forms of posteriors for guess and slip parameters were derived for the DINA model. The random walk Metropolis-Hastings algorithm was applied to parameter estimation in the rRUM. An algorithm for reducing potential label switching was incorporated into the estimation procedure. A method for simulating data with correlated attributes for the DINA model and rRUM was offered. Three simulation studies were conducted to evaluate the algorithm for Bayesian estimation. Twenty simulated data sets for simulation study 1 were generated from independent attributes for the DINA model and rRUM. A hundred data sets from correlated attributes were generated for the DINA and rRUM with guess and slip parameters set to 0.2 in simulation study 2. Simulation study 3 analyzed data sets simulated from the DINA model with guess and slip parameters generated from Uniform (0.1, 0.4). Results from simulation studies showed that the Q-matrix recovery rate was satisfactory. Using the fraction-subtraction data, an empirical study was conducted for the DINA model and rRUM. The estimated Q-matrices from the two models were compared with the expert-designed Q-matrix.Quantitative psychology and psychometrics, Statistics, Educational tests and measurementsHuman Development, Measurement and EvaluationDissertationsIncreasing Access to College-Level Math: Early Outcomes Using the Virginia Placement Test
http://academiccommons.columbia.edu/catalog/ac:175149
Rodríguez, Olgahttp://dx.doi.org/10.7916/D8HQ3X1PFri, 27 Jun 2014 00:00:00 +0000In spring 2012, the Virginia Community College System introduced a new math placement test, known as the Virginia Placement Test–Math (VPT). The system also implemented a new placement policy, with different math competencies required for the entry-level college math courses in liberal arts and STEM programs. This brief examines differences in students’ college math enrollment and completion rates before and after the introduction of the VPT and the new placement policy. After the VPT was implemented, a greater proportion of students placed into and enrolled in college-level math courses, and these higher enrollments boosted course completion rates for the cohort as a whole. However, pass rates among students who enrolled in entry-level math courses declined modestly. These findings highlight a tradeoff that should be acknowledged when planning reforms to reduce remedial placement rates using a placement instrument. Changes to how academic supports are deployed and changes to teaching and learning strategies used in college math courses could improve conditional pass rates over time.Community college education, Educational tests and measurements, Education policyor2125Institute on Education and the Economy, Community College Research CenterReportsApplication of ordered latent class regression model in educational assessment
http://academiccommons.columbia.edu/catalog/ac:161911
Cha, Jisunghttp://hdl.handle.net/10022/AC:P:20599Thu, 06 Jun 2013 00:00:00 +0000Latent class analysis is a useful tool to deal with discrete multivariate response data. Croon (1990) proposed the ordered latent class model where latent classes are ordered by imposing inequality constraints on the cumulative conditional response probabilities. Taking stochastic ordering of latent classes into account in the analysis of data gives a meaningful interpretation, since the primary purpose of a test is to order students on the latent trait continuum. This study extends Croon's model to ordered latent class regression that regresses latent class membership on covariates (e.g., gender, country) and demonstrates the utilities of an ordered latent class regression model in educational assessment using data from Trends in International Mathematics and Science Study (TIMSS). The benefit of this model is that item analysis and group comparisons can be done simultaneously in one model. The model is fitted by maximum likelihood estimation method with an EM algorithm. It is found that the proposed model is a useful tool for exploratory purposes as a special case of nonparametric item response models and cross-country difference can be modeled as different composition of discrete classes. Simulations is done to evaluate the performance of information criteria (AIC and BIC) in selecting the appropriate number of latent classes in the model. From the simulation results, AIC outperforms BIC for the model with the order-restricted maximum likelihood estimator.Educational tests and measurements, Statistics, Mathematics educationjc2320Human Development, Measurement and EvaluationDissertationsDealing with Sparse Rater Scoring of Constructed Responses within a Framework of a Latent Class Signal Detection Model
http://academiccommons.columbia.edu/catalog/ac:161491
Kim, Sunheehttp://hdl.handle.net/10022/AC:P:20440Thu, 23 May 2013 00:00:00 +0000In many assessment situations that use a constructed-response (CR) item, an examinee's response is evaluated by only one rater, which is called a single rater design. For example, in a classroom assessment practice, only one teacher grades each student's performance. While single rater designs are the most cost-effective method among all rater designs, the lack of a second rater causes difficulties with respect to how the scores should be used and evaluated. For example, one cannot assess rater reliability or rater effects when there is only one rater. The present study explores possible solutions for the issues that arise in sparse rater designs within the context of a latent class version of signal detection theory (LC-SDT) that has been previously used for rater scoring. This approach provides a model for rater cognition in CR scoring (DeCarlo, 2005; 2008; 2010) and offers measures of rater reliability and various rater effects. The following potential solutions to rater sparseness were examined: 1) the use of parameter restrictions to yield an identified model, 2) the use of informative priors in a Bayesian approach, and 3) the use of back readings (e.g., partially available 2nd rater observations), which are available in some large scale assessments. Simulations and analyses of real-world data are conducted to examine the performance of these approaches. Simulation results showed that using parameter constraints allows one to detect various rater effects that are of concern in practice. The Bayesian approach also gave useful results, although estimation of some of the parameters was poor and the standard deviations of the parameter posteriors were large, except when the sample size was large. Using back-reading scores gave an identified model and simulations showed that the results were generally acceptable, in terms of parameter estimation, except for small sample sizes. The paper also examines the utility of the approaches as applicable to the PIRLS USA reliability data. The results show some similarities and differences between parameter estimates obtained with posterior mode estimation and with Bayesian estimation. Sensitivity analyses revealed that rater parameter estimates are sensitive to the specification of the priors, as also found in the simulation results with smaller sample sizes.Educational tests and measurementsshk2125Human Development, Measurement and EvaluationDissertationsExamining the Impact of Examinee-Selected Constructed Response Items in the Context of a Hierarchical Rater Signal Detection Model
http://academiccommons.columbia.edu/catalog/ac:186227
Patterson, Brian Francishttp://dx.doi.org/10.7916/D8X929DCTue, 14 May 2013 15:36:13 +0000Research into the relatively rarely used examinee-selected item assessment designs has revealed certain challenges. This study aims to more comprehensively re-examine the key issues around examinee-selected items under a modern model for constructed-response scoring. Specifically, data were simulated under the hierarchical rater model with signal detection theory rater components (HRM-SDT; DeCarlo, Kim, and Johnson, 2011) and a variety of examinee-item selection mechanisms were considered. These conditions varied from the hypothetical baseline condition--where examinees choose randomly and with equal frequency from a pair of item prompts--to the perhaps more realistic and certainly more troublesome condition where examinees select items based on the very subject-area proficiency that the instrument intends to measure. While good examinee, item, and rater parameter recovery was apparent in the former condition for the HRM-SDT, serious issues with item and rater parameter estimation were apparent in the latter. Additional conditions were considered, as well as competing psychometric models for the estimation of examinee proficiency. Finally, practical implications of using examinee-selected item designs are given, as well as future directions for research.Educational tests and measurementsbfp2103Measurement and Evaluation, Human DevelopmentDissertationsStatistical Inference for Diagnostic Classification Models
http://academiccommons.columbia.edu/catalog/ac:160464
Xu, Gongjunhttp://hdl.handle.net/10022/AC:P:20058Tue, 30 Apr 2013 00:00:00 +0000Diagnostic classification models (DCM) are an important recent development in educational and psychological testing. Instead of an overall test score, a diagnostic test provides each subject with a profile detailing the concepts and skills (often called "attributes") that he/she has mastered. Central to many DCMs is the so-called Q-matrix, an incidence matrix specifying the item-attribute relationship. It is common practice for the Q-matrix to be specified by experts when items are written, rather than through data-driven calibration. Such a non-empirical approach may lead to misspecification of the Q-matrix and substantial lack of model fit, resulting in erroneous interpretation of testing results. This motivates our study and we consider the identifiability, estimation, and hypothesis testing of the Q-matrix. In addition, we study the identifiability of diagnostic model parameters under a known Q-matrix. The first part of this thesis is concerned with estimation of the Q-matrix. In particular, we present definitive answers to the learnability of the Q-matrix for one of the most commonly used models, the DINA model, by specifying a set of sufficient conditions under which the Q-matrix is identifiable up to an explicitly defined equivalence class. We also present the corresponding data-driven construction of the Q-matrix. The results and analysis strategies are general in the sense that they can be further extended to other diagnostic models. The second part of the thesis focuses on statistical validation of the Q-matrix. The purpose of this study is to provide a statistical procedure to help decide whether to accept the Q-matrix provided by the experts. Statistically, this problem can be formulated as a pure significance testing problem with null hypothesis H0 : Q = Q0, where Q0 is the candidate Q-matrix. We propose a test statistic that measures the consistency of observed data with the proposed Q-matrix. Theoretical properties of the test statistic are studied. In addition, we conduct simulation studies to show the performance of the proposed procedure. The third part of this thesis is concerned with the identifiability of the diagnostic model parameters when the Q-matrix is correctly specified. Identifiability is a prerequisite for statistical inference, such as parameter estimation and hypothesis testing. We present sufficient and necessary conditions under which the model parameters are identifiable from the response data.Statistics, Educational tests and measurementsgx2108StatisticsDissertationsImproving Developmental Education Assessment and Placement: Lessons From Community Colleges Across the Country
http://academiccommons.columbia.edu/catalog/ac:157295
Hodara, Michelle; Jaggars, Shanna; Karp, Melinda Jane Mechurhttp://hdl.handle.net/10022/AC:P:19262Wed, 06 Mar 2013 00:00:00 +0000At open-access two-year public colleges, the goal of the traditional assessment and placement process is to match incoming students to the developmental or college-level courses for which they have adequate preparation; the process presumably increases underprepared students’ chances of short- and long-term success in college while maintaining the academic quality and rigor of college-level courses. However, the traditional process may be limited in its ability to achieve these aims due to poor course placement accuracy and inconsistent standards of college readiness. To understand current approaches that seek to improve the process, we conducted a scan of assessment and placement policies and practices at open-access two-year colleges in Georgia, New Jersey, North Carolina, Oregon, Texas, Virginia, and Wisconsin. We describe the variety of approaches that systems and colleges employed to ameliorate poor course placement accuracy and inconsistent standards associated with the traditional process. Taking a broad view of the extent of these approaches, we find that most colleges we studied adopted a measured approach that addressed a single limitation without attending to other limitations that contribute to the same overall problem of poor course placement accuracy or inconsistent standards. Much less common were comprehensive approaches that attended to multiple limitations of the process; these approaches were likely to result from changes to developmental education as a whole. Drawing from the study’s findings, we also discuss how colleges can overcome barriers to reform in order to implement approaches that hold promise for improved course placement accuracy, more consistent standards of college readiness, and, potentially, greater long-term academic success of community college students.Community college education, Educational tests and measurementsmeh70, sj2391, mjm305Economics and Education, Institute on Education and the Economy, Community College Research CenterWorking papersAssessing Developmental Assessment in Community Colleges
http://academiccommons.columbia.edu/catalog/ac:146946
Hughes, Katherine Lee; Scott-Clayton, Judith E.http://hdl.handle.net/10022/AC:P:13231Thu, 17 May 2012 00:00:00 +0000Placement exams are high-stakes assessments that determine many students' college trajectories. The majority of community colleges use placement exams—most often the ACCUPLACER, developed by the College Board, or the COMPASS, developed by ACT, Inc.—to sort students into college-level or developmental education courses in math, reading, and sometimes writing. More than half of entering students at community colleges are placed into developmental education in at least one subject as a result. But the evidence on the predictive validity of these tests is not as strong as many might assume, given the stakes involved—and recent research fails to find evidence that the resulting placements into remediation improve student outcomes. While this has spurred debate about the content and delivery of remedial coursework, it is possible that the assessment process itself may be broken; the debate about remediation policy is incomplete without a fuller understanding of the role of assessment. This Brief examines the role of developmental assessment, the validity of the most common assessments currently in use, and emerging directions in assessment policy and practice. Alternative methods of assessment—particularly those involving multiple measures of student preparedness—seem to have the potential to improve student outcomes, but more research is needed to determine what type of change in assessment and placement policy might improve persistence and graduation rates. The Brief concludes with a discussion of implications for policy and research.Community college education, Educational tests and measurementskh154, js3676International and Transcultural Studies, Institute on Education and the Economy, Community College Research CenterReportsOn the Use of Covariates in a Latent Class Signal Detection Model, with Applications to Constructed Response Scoring
http://academiccommons.columbia.edu/catalog/ac:146692
Wang, Zijian Geraldhttp://hdl.handle.net/10022/AC:P:13156Mon, 07 May 2012 00:00:00 +0000A latent class signal detection (SDT) model was recently introduced as an alternative to traditional item response theory (IRT) methods in the analysis of constructed response data. This class of models can be represented as restricted latent class models and differ from the IRT approach in the way the latent construct is conceptualized. One appeal of the signal detection approach is that it provides an intuitive framework from which psychological processes governing rater behavior can be better understood. The present study developed an extension of the latent class SDT model to include covariates and examined the performance of the resulting model. Covariates can be incorporated into the latent class SDT model in three ways: 1) to affect latent class membership, 2) conditional response probabilities and 3) both latent class membership and conditional response probabilities. In each case, simulations were conducted to investigate both parameter recovery and classification accuracy of the extended model under two competing rater designs; in addition, implications of ignoring covariate effects and covariate misspecification were explored. Here, the ability of information criteria, namely the AIC, small sample adjusted AIC and BIC, in recovering the true model with respect to how covariates are introduced was also examined. Results indicate that parameters were generally well recovered in fully-crossed designs; to obtain similar levels of estimation precision in incomplete designs, sample size requirements were comparatively higher and depend on the number of indicators used. When covariate effects were not accounted for or misspecified, results show that parameter estimates tend to be severely biased, which in turn reduced classification accuracy. With respect to model recovery, the BIC performed the most consistently amongst the information criteria considered. In light of these findings, recommendations were made with regard to sample size requirements and model building strategies when implementing the extended latent class SDT model.Educational tests and measurementszgw2Human Development, Measurement and EvaluationDissertationsAssessing Developmental Assessment in Community Colleges
http://academiccommons.columbia.edu/catalog/ac:146649
Hughes, Katherine Lee; Scott-Clayton, Judith E.http://hdl.handle.net/10022/AC:P:13143Fri, 04 May 2012 00:00:00 +0000Placement exams are high-stakes assessments that determine many students' college trajectories. The majority of community colleges use placement exams—most often the ACCUPLACER, developed by the College Board, or the COMPASS, developed by ACT, Inc.—to sort students into college-level or developmental education courses in math, reading, and sometimes writing. More than half of entering students at community colleges are placed into developmental education in at least one subject as a result. But the evidence on the predictive validity of these tests is not as strong as many might assume, given the stakes involved—and recent research fails to find evidence that the resulting placements into remediation improve student outcomes. While this has spurred debate about the content and delivery of remedial coursework, it is possible that the assessment process itself may be broken; the debate about remediation policy is incomplete without a fuller understanding of the role of assessment. This paper examines the extent of consensus regarding the role of developmental assessment and how it is best implemented, the validity of the most common assessments currently in use, and emerging directions in assessment policy and practice. Alternative methods of assessment—particularly those involving multiple measures of student preparedness—seem to have the potential to improve student outcomes, but more research is needed to determine what type of change in assessment and placement policy might improve persistence and graduation rates. The paper concludes with a discussion of gaps in the literature and implications for policy and research.Community college education, Educational tests and measurementskh154, js3676International and Transcultural Studies, Institute on Education and the Economy, Community College Research CenterWorking papersDo High-Stakes Placement Exams Predict College Success?
http://academiccommons.columbia.edu/catalog/ac:146482
Scott-Clayton, Judith E.http://hdl.handle.net/10022/AC:P:13085Wed, 02 May 2012 00:00:00 +0000Community colleges are typically assumed to be nonselective, open-access institutions. Yet access to college-level courses at such institutions is far from guaranteed: the vast majority of two-year institutions administer high-stakes exams to entering students that determine their placement into either college-level or remedial education. Despite the stakes involved, there has been relatively little research investigating whether such exams are valid for their intended purpose, or whether other measures of preparedness might be equally or even more effective. This paper contributes to the literature by analyzing the predictive validity of one of the most commonly used assessments, using data on over 42,000 first-time entrants to a large, urban community college system. Using both traditional correlation coefficients as well as more useful decision-theoretic measures of placement accuracy and error rates, I find that placement exams are more predictive of success in math than in English, and more predictive of who is likely to do well in college-level coursework than of who is likely to fail. Utilizing multiple measures to make placement decisions could reduce severe misplacements by about 15 percent without changing the remediation rate, or could reduce the remediation rate by 8 to 12 percentage points while maintaining or increasing success rates in college-level courses. Implications and limitations are discussed.Community college education, Educational tests and measurementsjs3676International and Transcultural Studies, Community College Research CenterWorking papersPredicting Success in College: The Importance of Placement Tests and High School Transcripts
http://academiccommons.columbia.edu/catalog/ac:146486
Belfield, Clive; Crosta, Peter Michaelhttp://hdl.handle.net/10022/AC:P:13086Wed, 02 May 2012 00:00:00 +0000This paper uses student-level data from a statewide community college system to examine the validity of placement tests and high school information in predicting course grades and college performance. It considers the ACCUPLACER and COMPASS placement tests, using two quantitative and two literacy tests from each battery. The authors find that placement tests do not yield strong predictions of how students will perform in college. Placement test scores are positively—but weakly—associated with college grade point average (GPA). The correlation disappears when high school GPA is controlled for. Placement test scores are positively associated with college credit accumulation even after controlling for high school GPA. After three to five semesters, a student with a placement test score in the highest quartile has on average nine credits more than a student with a placement test score in the lowest quartile. In contrast, high school GPAs are useful for predicting many aspects of students' college performance. High school GPA has a strong association with college GPA; students' college GPAs are approximately 0.6 units below their high school GPAs. High school GPA also has a strong association with college credit accumulation. A student whose high school GPA is one grade higher will have accumulate approximately four extra credits per semester. Other information from high school transcripts is modestly useful; this includes number of math and English courses taken in high school, honors courses, number of F grades, and number of credits. This high school information is not independently useful beyond high school GPA, and collectively it explains less variation in college performance. The authors also calculate accuracy rates and four validity metrics for placement tests. They find high "severe" error rates using the placement test cutoffs. The severe error rate for English is 27 to 33 percent; i.e., three out of every ten students is severely misassigned. For math, the severe error rates are lower but still nontrivial. Using high school GPA instead of placement tests reduces the severe error rates by half across both English and math.Community college education, Educational tests and measurementscb2505, pmc2107Economics and Education, Institute on Education and the Economy, Community College Research Center, National Center for the Study of Privatization in EducationWorking papersRater Drift in Constructed Response Scoring via Latent Class Signal Detection Theory and Item Response Theory
http://academiccommons.columbia.edu/catalog/ac:132272
Park, Yoon Soohttp://hdl.handle.net/10022/AC:P:10394Tue, 17 May 2011 00:00:00 +0000The use of constructed response (CR) items or performance tasks to assess test takers' ability has grown tremendously over the past decade. Examples of CR items in psychological and educational measurement range from essays, works of art, and admissions interviews. However, unlike multiple-choice (MC) items that have predetermined options, CR items require test takers to construct their own answer. As such, they require the judgment of multiple raters that are subject to differences in perception and prior knowledge of the material being evaluated. As with any scoring procedure, the scores assigned by raters must be comparable over time and over different test administrations and forms; in other words, scores must be reliable and valid for all test takers, regardless of when an individual takes the test. This study examines how longitudinal patterns or changes in rater behavior affect model-based classification accuracy. Rater drift refers to changes in rater behavior across different test administrations. Prior research has found evidence of drift. Rater behavior in CR scoring is examined using two measurement models - latent class signal detection theory (SDT) and item response theory (IRT) models. Rater effects (e.g., leniency and strictness) are partly examined with simulations, where the ability of different models to capture changes in rater behavior is studied. Drift is also examined in two real-world large scale tests: teacher certification test and high school writing test. These tests use the same set of raters for long periods of time, where each rater's scoring is examined on a monthly basis. Results from the empirical analysis showed that rater models were effective to detect changes in rater behavior over testing administrations in real-world data. However, there were differences in rater discrimination between the latent class SDT and IRT models. Simulations were used to examine the effect of rater drift on classification accuracy and on differences between the latent class SDT and IRT models. Changes in rater severity had only a minimal effect on classification. Rater discrimination had a greater effect on classification accuracy. This study also found that IRT models detected changes in rater severity and in rater discrimination even when data were generated from the latent class SDT model. However, when data were non-normal, IRT models underestimated rater discrimination, which may lead to incorrect inferences on the precision of raters. These findings provide new and important insights into CR scoring and issues that emerge in practice, including methods to improve rater training.Quantitative psychology and psychometrics, Educational tests and measurements, Statisticsysp2102Human Development, National Center for Disaster Preparedness, Measurement and EvaluationDissertationsThe central role of noise in evaluating interventions that use test scores to rank schools
http://academiccommons.columbia.edu/catalog/ac:115957
Chay, Kenneth; McEwan, Patrick J.; Urquiola, Miguel S.http://hdl.handle.net/10022/AC:P:493Thu, 24 Mar 2011 00:00:00 +0000Several countries have implemented programs that use test scores to rank schools, and to reward or penalize them based on their students' average performance. Recently, Kane and Staiger (2002) have warned that imprecision in the measurement of school-level test scores could impede these efforts. There is little evidence, however, on how seriously noise hinders the evaluation of the impact of these interventions. We examine these issues in the context of Chile's P-900 program-a country-wide intervention in which resources were allocated based on cutoffs in schools' mean test scores. We show that transitory noise in average scores and mean reversion lead conventional estimation approaches to greatly overstate the impacts of such programs. We then show how a regression discontinuity design that utilizes the discrete nature of the selection rule can be used to control for reversion biases. While the RD analysis provides convincing evidence that the P-900 program had significant effects on test score gains, these effects are much smaller than is widely believed.Educational tests and measurementsmsu2101Economics, International and Public AffairsWorking papers