Academic Commons Search Results
http://academiccommons.columbia.edu/catalog.rss?f%5Bsubject_facet%5D%5B%5D=Educational+tests+and+measurements&q=&rows=500&sort=record_creation_date+desc
Academic Commons Search Resultsen-usEstimation of Q-matrix for DINA Model Using the Constrained Generalized DINA Framework
http://academiccommons.columbia.edu/catalog/ac:198654
Li, Huachenghttp://dx.doi.org/10.7916/D88W3DB2Thu, 05 May 2016 21:18:24 +0000The research of cognitive diagnostic models (CDMs) is becoming an important field of psychometrics. Instead of assigning one score, CDMs provide attribute profiles to indicate the mastering status of concepts or skills for the examinees. This would make the test result more informative. The implementation of many CDMs relies on the existing item-to-attribute relationship, which means that we need to know the concepts or skills each item requires. The relationships between the items and attributes could be summarized into the Q-matrix. Misspecification of the Q-matrix will lead to incorrect attribute profile. The Q-matrix can be designed by expert judgement, but it is possible that such practice can be subjective. There are previous researches about the Q-matrix estimation. This study proposes an estimation method for one of the most parsimonious CDMs, the DINA model. The method estimates the Q-matrix for DINA model by setting constraints on the generalized DINA model. In the simulation study, the results showed that the estimated Q-matrix fit better the empirical fraction subtraction data than the expert-design Q-matrix. We also show that the proposed method may still be applicable when the constraints were relaxed.Educational tests and measurements, Statisticshl2536Measurement and Evaluation, Human DevelopmentDissertationsExploring Skill Condensation Rules for Cognitive Diagnostic Models in a Bayesian Framework
http://academiccommons.columbia.edu/catalog/ac:193918
Luna Bazaldua, Diego A.http://dx.doi.org/10.7916/D8NP247CWed, 27 Jan 2016 23:19:12 +0000Diagnostic paradigms are becoming an alternative to normative approaches in educational assessment. One of the principal objectives of diagnostic assessment is to determine skill proficiency for tasks that demand the use of specific cognitive processes. Ideally, diagnostic assessments should include accurate information about the skills required to correctly answer each item in a test, as well as any additional evidence about the interaction between those cognitive constructs. Nevertheless, little research in the field has focused on the types of interactions (i.e., the condensation rules) among skills in models for cognitive diagnosis.
The present study introduces a Bayesian approach to determine the underlying interaction among the skills measured by a given item when comparing among models with conjunctive, disjunctive, and compensatory condensation rules. Following the reparameterization framework proposed by DeCarlo (2011), the present study includes transformations for disjunctive and compensatory models. Next, a methodology that compares between pairs of models with different condensation rules is presented; parameters in the model and their distribution were defined considering former Bayesian approaches proposed in the literature.
Simulation studies and empirical studies were performed to test the capacity of the model to correctly identify the underlying condensation rule. Overall, results from the simulation study showed that the correct condensation rule is correctly identified across conditions. The results showed that the correct condensation rule identification depends on the item parameter values used to generate the data and the use of informative prior distributions for the model parameters. Latent class sizes parameters for the skills and their respective hyperparameters also showed a good recovery in the simulation study. The recovery of the item parameters presented limitations, so some guidelines to improve their estimation are presented in the results and discussion sections.
The empirical studies highlighted the usefulness of this approach in determining the interaction among skills using real items from a mathematics test and a language test. Despite the differences in their area of knowledge and Q-matrix structure, results indicated that both tests are composed in a higher proportion of conjunctive items that demand the mastery of all skills.Educational tests and measurements, Quantitative psychology, Educational evaluationdal2159Measurement and EvaluationDissertationsHow Teaching Matters
http://academiccommons.columbia.edu/catalog/ac:193132
Silverstein, Samuel C.http://dx.doi.org/10.7916/D8SN08RWWed, 20 Jan 2016 12:05:38 +0000Silverstein writes a letter to the editor of CBE Life Science Education in response to Harold Wenglinsky's paper “How Teaching Matters: Bringing the Classroom Back into Discussions of Teacher Quality” (www.ets.org/research/pic) which analyzes how the attributes and classroom practices of teachers affect the performance of eighth-grade students on standardized tests in science. Wenglinsky identified teacher attributes and practices that are highly correlated with superior student achievement. Two of the four attributes and practices identified by Wenglinsky (i.e., teacher laboratory skills and implementation of hands-on classroom exercises) are the central focus of Columbia University's Summer Research Program for Science Teachers (www.scienceteacherprogram.org) and of other Science Work Experience Programs for Teachers. Data to be published elsewhere show that teacher participation in Columbia's program has a very positive impact on their students' success in passing a New York State Regents exam in science. Confirmation of Wenglinsky's postulates could greatly simplify the task of improving middle and high school science education. For this reason alone, it is important to test Wenglinsky's conclusions rigorously and soon.Educational tests and measurements, Middle school education, Science educationscs3Physiology and Cellular BiophysicsArticlesUse of Mnemonics in Learning Novel Foreign Vocabulary: Help or Hindrance?
http://academiccommons.columbia.edu/catalog/ac:195677
Liu, Yeu-Tinghttp://dx.doi.org/10.7916/D80P0ZGSThu, 08 Oct 2015 17:33:59 +0000Research has consistently indicated that the use of mnemonic devices substantially enhances higher levels of retention in immediate recall of second language vocabulary words in comparison with other learning strategies. However, the evidence does not explain why the immediate benefits of mnemonic devices fail to extend to long-term retention. In addition, research on mnemonics has drawn mostly on the assessment of phonetic-hospitable languages such as English (as opposed to image-hospitable languages such as Chinese). To examine the use of mnemonic devices more thoroughly, this review will draw on psychological research on memory to discuss the efficacy of mnemonic methods, as opposed to rote rehearsal, in learning vocabulary in phonetic- and image-hospitable languages.Education, Educational technology, Educational tests and measurements, English as a second language, Foreign language instruction, LanguageApplied Linguistics and Teaching English to Speakers of Other LanguagesArticlesAnalyzing the Longitudinal K-12 Grading Histories of Entire Cohorts of Students: Grades, Data Driven Decision Making, Dropping Out and Hierarchical Cluster Analysis
http://academiccommons.columbia.edu/catalog/ac:188651
Bowers, Alex J.http://dx.doi.org/10.7916/D8QC02TXTue, 22 Sep 2015 13:12:10 +0000School personnel currently lack an effective method to pattern and visually interpret disaggregated achievement data collected on students as a means to help inform decision making. This study, through the examination of longitudinal K-12 teacher assigned grading histories for entire cohorts of students from a school district (n=188), demonstrates a novel application of hierarchical cluster analysis and pattern visualization in which all data points collected on every student in a cohort can be patterned, visualized and interpreted to aid in data driven decision making by teachers and administrators. Additionally, as a proof-of-concept study, overall schooling outcomes, such as student dropout or taking a college entrance exam, are identified from the data patterns and compared to past methods of dropout identification as one example of the usefulness of the method. Hierarchical cluster analysis correctly identified over 80% of the students who dropped out using the entire student grade history patterns from either K-12 or K-8.Educational leadership, Elementary education, Secondary education, Educational tests and measurementsab3764Education LeadershipArticlesThe Politics of International Large-Scale Assessment: The Programme for International Student Assessment (PISA) and American Education Discourse, 2000-2012
http://academiccommons.columbia.edu/catalog/ac:187929
Green Saraisky, Nancyhttp://dx.doi.org/10.7916/D8DB80WWTue, 12 May 2015 18:30:28 +0000The number of countries participating in large-scale international assessments has grown dramatically during the past two decades and the use of assessment results in national-level education policy debate has increased commensurately. Recent literature on the role of international assessments in education politics suggests that rankings and performance indicators can shape national educational discourse in important ways. This dissertation examines the use of one such assessment, the Programme for International Student Assessment (PISA), in education discourse in the United States from 2000 to 2012. The United States played a key role in the development of PISA and has participated in almost every international assessment of the past fifty years. Yet scholars have mostly overlooked the reception of international assessment in the United States. This dissertation seeks to address this gap.
Using an original dataset of one hundred and thirty texts from American academic literature, think tanks and the media, I examine the use of references to PISA and to top scoring countries on PISA, e.g., Finland and China (Shanghai), during the first decade of PISA testing. I find that PISA has rapidly become an accepted comparative measure of educational excellence throughout US discourse. However, despite consistently middling American scores, attempts to turn America’s PISA performance into a crisis of the US education system have not stuck. Instead, I suggest that both global and domestic politics play a stronger role in shaping the interpretations of student achievement on PISA than does student performance. I show how the American PISA discourse: (1) is driven by political, not empirical, realities; (2) contains few calls for policy borrowing from top-scoring countries and has not engendered any direct efforts at policy reform; (3) is framed with remarkable consistency across the political spectrum; and (4) is a profoundly elite enterprise, privileging the voices of international organizations and policy makers over those of parents, teachers and students.Education, Political science, Educational tests and measurementsnlg2004Comparative and International EducationDissertationsImproving the Targeting of Treatment: Evidence from College Remediation
http://academiccommons.columbia.edu/catalog/ac:178925
Scott-Clayton, Judith E.; Crosta, Peter Michael; Belfield, Clivehttp://dx.doi.org/10.7916/D8T15284Wed, 22 Oct 2014 13:49:32 +0000At an annual cost of roughly $7 billion nationally, remedial coursework is one of the single largest interventions intended to improve outcomes for underprepared college students. But like a costly medical treatment with non-trivial side effects, the value of remediation overall depends upon whether those most likely to benefit can be identified in advance. This NBER working paper uses administrative data and a rich predictive model to examine the accuracy of remedial screening tests, either instead of or in addition to using high school transcript data to determine remedial assignment.
The authors find that roughly one in four test-takers in math and one in three test-takers in English are severely mis-assigned under current test-based policies, with mis-assignments to remediation much more common than mis-assignments to college-level coursework. Using high school transcript information—either instead of or in addition to test scores—could significantly reduce the prevalence of assignment errors. Further, the choice of screening device has significant implications for the racial and gender composition of both remedial and college-level courses. Finally, if institutions took account of students’ high school performance, they could remediate substantially fewer students without lowering success rates in college-level courses.Higher education, Educational tests and measurementsjs3676, pmc2107, cb2001Education Policy and Social Analysis, Economics and Education, National Center for the Study of Privatization in Education, Community College Research CenterWorking papersAn item response theory approach to longitudinal analysis with application to summer setback in preschool language/literacy
http://academiccommons.columbia.edu/catalog/ac:198809
Kim, Sunhee; Camilli, Gregoryhttp://dx.doi.org/10.7916/D8WS8RR1Tue, 23 Sep 2014 04:58:27 +0000Background: As the popularity of classroom observations has increased, they have been implemented in many longitudinal studies with large probability samples. Given the complexity of longitudinal measurements, there is a need for tools to investigate both growth and the properties of the measurement scale. Methods: A practical IRT model with an embedded growth model is illustrated to examine the psychometric characteristics of classroom assessments for preschool children, and also to show how nonlinear learning over time can be investigated. This approach is applied to data collected for the Academic Rating Scale (ARS) in the literacy domain, which was administered on four occasions over two years. Results: The model enabled an effective illustration of overall and individual gains over two academic years. In particular, a significant de-acceleration in latent literacy skills during summer was observed. The results also provided psychometric support for the argument that ARS literacy can be used to assess developmental skill levels consistent with theories of early literacy acquisition. Conclusions: The proposed IRT approach provided growth parameters that are estimated directly, rather than obtaining these coefficients from estimated growth scores—which may result in biased and inconsistent estimates of growth parameters. The model is also capable of simultaneously representing parameters of items and persons.Educational evaluation, Educational tests and measurementsshk2125MedicineArticlesStatistical Inference and Experimental Design for Q-matrix Based Cognitive Diagnosis Models
http://academiccommons.columbia.edu/catalog/ac:176169
Zhang, Stephaniehttp://dx.doi.org/10.7916/D8TQ5ZP5Mon, 07 Jul 2014 11:50:59 +0000There has been growing interest in recent years in using cognitive diagnosis models for diagnostic measurement, i.e., classification according to multiple discrete latent traits. The Q-matrix, an incidence matrix specifying the presence or absence of a relationship between each item in the assessment and each latent attribute, is central to many of these models. Important applications include educational and psychological testing; demand in education, for example, has been driven by recent focus on skills-based evaluation. However, compared to more traditional models coming from classical test theory and item response theory, cognitive diagnosis models are relatively undeveloped and suffer from several issues limiting their applicability. This thesis exams several issues related to statistical inference and experimental design for Q-matrix based cognitive diagnosis models.
We begin by considering one of the main statistical issues affecting the practical use of Q-matrix based cognitive diagnosis models, the identifiability issue. In statistical models, identifiability is prerequisite for most common statistical inferences, including parameter estimation and hypothesis testing. With Q-matrix based cognitive diagnosis models, identifiability also affects the classification of respondents according to their latent traits. We begin by examining the identifiability of model parameters, presenting necessary and sufficient conditions for identifiability in several settings.
Depending on the area of application and the researcher's degree of control over the experiment design, fulfilling these identifiability conditions may be difficult. The second part of this thesis proposes new methods for parameter estimation and respondent classification for use with non-identifiable models. In addition, our framework allows consistent estimation of the severity of the non-identifiability problem, in terms of the proportion of the population affected by it. The implications of this measure for the design of diagnostic assessments are also discussed.Statistics, Educational tests and measurements, Quantitative psychology and psychometricsStatisticsDissertationsEstimating the Q-matrix for Cognitive Diagnosis Models in a Bayesian Framework
http://academiccommons.columbia.edu/catalog/ac:176107
Chung, Meng-tahttp://dx.doi.org/10.7916/D857195BMon, 07 Jul 2014 11:49:03 +0000This research aims to develop an MCMC algorithm for estimating the Q-matrix in a Bayesian framework. A saturated multinomial model was used to estimate correlated attributes in the DINA model and rRUM. Closed-forms of posteriors for guess and slip parameters were derived for the DINA model. The random walk Metropolis-Hastings algorithm was applied to parameter estimation in the rRUM. An algorithm for reducing potential label switching was incorporated into the estimation procedure. A method for simulating data with correlated attributes for the DINA model and rRUM was offered.
Three simulation studies were conducted to evaluate the algorithm for Bayesian estimation. Twenty simulated data sets for simulation study 1 were generated from independent attributes for the DINA model and rRUM. A hundred data sets from correlated attributes were generated for the DINA and rRUM with guess and slip parameters set to 0.2 in simulation study 2. Simulation study 3 analyzed data sets simulated from the DINA model with guess and slip parameters generated from Uniform (0.1, 0.4). Results from simulation studies showed that the Q-matrix recovery rate was satisfactory. Using the fraction-subtraction data, an empirical study was conducted for the DINA model and rRUM. The estimated Q-matrices from the two models were compared with the expert-designed Q-matrix.Quantitative psychology and psychometrics, Statistics, Educational tests and measurementsMeasurement and Evaluation, Human DevelopmentDissertationsIncreasing Access to College-Level Math: Early Outcomes Using the Virginia Placement Test
http://academiccommons.columbia.edu/catalog/ac:175149
Rodríguez, Olgahttp://dx.doi.org/10.7916/D8HQ3X1PFri, 27 Jun 2014 12:03:42 +0000In spring 2012, the Virginia Community College System introduced a new math placement test, known as the Virginia Placement Test–Math (VPT). The system also implemented a new placement policy, with different math competencies required for the entry-level college math courses in liberal arts and STEM programs. This brief examines differences in students’ college math enrollment and completion rates before and after the introduction of the VPT and the new placement policy. After the VPT was implemented, a greater proportion of students placed into and enrolled in college-level math courses, and these higher enrollments boosted course completion rates for the cohort as a whole. However, pass rates among students who enrolled in entry-level math courses declined modestly. These findings highlight a tradeoff that should be acknowledged when planning reforms to reduce remedial placement rates using a placement instrument. Changes to how academic supports are deployed and changes to teaching and learning strategies used in college math courses could improve conditional pass rates over time.Community college education, Educational tests and measurements, Education policyor2125Institute on Education and the Economy, Community College Research CenterReportsApplication of ordered latent class regression model in educational assessment
http://academiccommons.columbia.edu/catalog/ac:161911
Cha, Jisunghttp://hdl.handle.net/10022/AC:P:20599Thu, 06 Jun 2013 15:11:28 +0000Latent class analysis is a useful tool to deal with discrete multivariate response data. Croon (1990) proposed the ordered latent class model where latent classes are ordered by imposing inequality constraints on the cumulative conditional response probabilities. Taking stochastic ordering of latent classes into account in the analysis of data gives a meaningful interpretation, since the primary purpose of a test is to order students on the latent trait continuum. This study extends Croon's model to ordered latent class regression that regresses latent class membership on covariates (e.g., gender, country) and demonstrates the utilities of an ordered latent class regression model in educational assessment using data from Trends in International Mathematics and Science Study (TIMSS). The benefit of this model is that item analysis and group comparisons can be done simultaneously in one model. The model is fitted by maximum likelihood estimation method with an EM algorithm. It is found that the proposed model is a useful tool for exploratory purposes as a special case of nonparametric item response models and cross-country difference can be modeled as different composition of discrete classes. Simulations is done to evaluate the performance of information criteria (AIC and BIC) in selecting the appropriate number of latent classes in the model. From the simulation results, AIC outperforms BIC for the model with the order-restricted maximum likelihood estimator.Educational tests and measurements, Statistics, Mathematics educationjc2320Measurement and Evaluation, Human DevelopmentDissertationsDealing with Sparse Rater Scoring of Constructed Responses within a Framework of a Latent Class Signal Detection Model
http://academiccommons.columbia.edu/catalog/ac:161491
Kim, Sunheehttp://hdl.handle.net/10022/AC:P:20440Thu, 23 May 2013 13:08:18 +0000In many assessment situations that use a constructed-response (CR) item, an examinee's response is evaluated by only one rater, which is called a single rater design. For example, in a classroom assessment practice, only one teacher grades each student's performance. While single rater designs are the most cost-effective method among all rater designs, the lack of a second rater causes difficulties with respect to how the scores should be used and evaluated. For example, one cannot assess rater reliability or rater effects when there is only one rater. The present study explores possible solutions for the issues that arise in sparse rater designs within the context of a latent class version of signal detection theory (LC-SDT) that has been previously used for rater scoring. This approach provides a model for rater cognition in CR scoring (DeCarlo, 2005; 2008; 2010) and offers measures of rater reliability and various rater effects. The following potential solutions to rater sparseness were examined: 1) the use of parameter restrictions to yield an identified model, 2) the use of informative priors in a Bayesian approach, and 3) the use of back readings (e.g., partially available 2nd rater observations), which are available in some large scale assessments. Simulations and analyses of real-world data are conducted to examine the performance of these approaches. Simulation results showed that using parameter constraints allows one to detect various rater effects that are of concern in practice. The Bayesian approach also gave useful results, although estimation of some of the parameters was poor and the standard deviations of the parameter posteriors were large, except when the sample size was large. Using back-reading scores gave an identified model and simulations showed that the results were generally acceptable, in terms of parameter estimation, except for small sample sizes. The paper also examines the utility of the approaches as applicable to the PIRLS USA reliability data. The results show some similarities and differences between parameter estimates obtained with posterior mode estimation and with Bayesian estimation. Sensitivity analyses revealed that rater parameter estimates are sensitive to the specification of the priors, as also found in the simulation results with smaller sample sizes.Educational tests and measurementsshk2125Measurement and Evaluation, Human DevelopmentDissertationsAn Item Response Theory Approach to Causal Inference in the Presence of a Pre-intervention Assessment
http://academiccommons.columbia.edu/catalog/ac:188469
Marini, Jessicahttp://dx.doi.org/10.7916/D8WM1CR3Thu, 16 May 2013 16:18:35 +0000This research develops a form of causal inference based on Item Response Theory (IRT) to combat bias that occurs when existing causal inference methods are used under certain scenarios. When a pre-test is administered, prior to a treatment decision, bias can occur in causal inferences about the decision's effect on the outcome. This new IRT based method uses item-level information, treatment placement, and the outcome to produce estimates of each subject's ability in the chosen domain. Examining a causal inference research question in an IRT model-based framework becomes a model-based way to match subjects on estimates of their true ability. This model-based matching allows inferences to be made about a subject's performance as if they had been in the opposite treatment group. The IRT method is developed to combat existing methods' downfalls such as relying on conditional independence between pre-test scores and outcomes. Using simulation, the IRT method is compared to existing methods under two different model scenarios in terms of Type I and Type II errors. Then the method's parameter recovery is analyzed followed by accuracy of treatment effect evaluation. The IRT method is shown to out perform existing methods in an ability-based scenario. Finally, the IRT method is applied to real data assessing the impact of advanced STEM in high school on a students choice of major, and compared to existing alternative approaches.Educational tests and measurements, Statisticsjpm2120Measurement and Evaluation, Human DevelopmentDissertationsExamining the Impact of Examinee-Selected Constructed Response Items in the Context of a Hierarchical Rater Signal Detection Model
http://academiccommons.columbia.edu/catalog/ac:186227
Patterson, Brian Francishttp://dx.doi.org/10.7916/D8X929DCTue, 14 May 2013 15:36:13 +0000Research into the relatively rarely used examinee-selected item assessment designs has revealed certain challenges. This study aims to more comprehensively re-examine the key issues around examinee-selected items under a modern model for constructed-response scoring. Specifically, data were simulated under the hierarchical rater model with signal detection theory rater components (HRM-SDT; DeCarlo, Kim, and Johnson, 2011) and a variety of examinee-item selection mechanisms were considered. These conditions varied from the hypothetical baseline condition--where examinees choose randomly and with equal frequency from a pair of item prompts--to the perhaps more realistic and certainly more troublesome condition where examinees select items based on the very subject-area proficiency that the instrument intends to measure. While good examinee, item, and rater parameter recovery was apparent in the former condition for the HRM-SDT, serious issues with item and rater parameter estimation were apparent in the latter. Additional conditions were considered, as well as competing psychometric models for the estimation of examinee proficiency. Finally, practical implications of using examinee-selected item designs are given, as well as future directions for research.Educational tests and measurementsbfp2103Measurement and Evaluation, Human DevelopmentDissertationsStatistical Inference for Diagnostic Classification Models
http://academiccommons.columbia.edu/catalog/ac:160464
Xu, Gongjunhttp://hdl.handle.net/10022/AC:P:20058Tue, 30 Apr 2013 16:06:11 +0000Diagnostic classification models (DCM) are an important recent development in educational and psychological testing. Instead of an overall test score, a diagnostic test provides each subject with a profile detailing the concepts and skills (often called "attributes") that he/she has mastered. Central to many DCMs is the so-called Q-matrix, an incidence matrix specifying the item-attribute relationship. It is common practice for the Q-matrix to be specified by experts when items are written, rather than through data-driven calibration. Such a non-empirical approach may lead to misspecification of the Q-matrix and substantial lack of model fit, resulting in erroneous interpretation of testing results. This motivates our study and we consider the identifiability, estimation, and hypothesis testing of the Q-matrix. In addition, we study the identifiability of diagnostic model parameters under a known Q-matrix. The first part of this thesis is concerned with estimation of the Q-matrix. In particular, we present definitive answers to the learnability of the Q-matrix for one of the most commonly used models, the DINA model, by specifying a set of sufficient conditions under which the Q-matrix is identifiable up to an explicitly defined equivalence class. We also present the corresponding data-driven construction of the Q-matrix. The results and analysis strategies are general in the sense that they can be further extended to other diagnostic models. The second part of the thesis focuses on statistical validation of the Q-matrix. The purpose of this study is to provide a statistical procedure to help decide whether to accept the Q-matrix provided by the experts. Statistically, this problem can be formulated as a pure significance testing problem with null hypothesis H0 : Q = Q0, where Q0 is the candidate Q-matrix. We propose a test statistic that measures the consistency of observed data with the proposed Q-matrix. Theoretical properties of the test statistic are studied. In addition, we conduct simulation studies to show the performance of the proposed procedure. The third part of this thesis is concerned with the identifiability of the diagnostic model parameters when the Q-matrix is correctly specified. Identifiability is a prerequisite for statistical inference, such as parameter estimation and hypothesis testing. We present sufficient and necessary conditions under which the model parameters are identifiable from the response data.Statistics, Educational tests and measurementsgx2108StatisticsDissertationsImproving Developmental Education Assessment and Placement: Lessons From Community Colleges Across the Country
http://academiccommons.columbia.edu/catalog/ac:157295
Hodara, Michelle; Jaggars, Shanna; Karp, Melinda Jane Mechurhttp://hdl.handle.net/10022/AC:P:19262Wed, 06 Mar 2013 10:27:08 +0000At open-access two-year public colleges, the goal of the traditional assessment and placement process is to match incoming students to the developmental or college-level courses for which they have adequate preparation; the process presumably increases underprepared students’ chances of short- and long-term success in college while maintaining the academic quality and rigor of college-level courses. However, the traditional process may be limited in its ability to achieve these aims due to poor course placement accuracy and inconsistent standards of college readiness. To understand current approaches that seek to improve the process, we conducted a scan of assessment and placement policies and practices at open-access two-year colleges in Georgia, New Jersey, North Carolina, Oregon, Texas, Virginia, and Wisconsin. We describe the variety of approaches that systems and colleges employed to ameliorate poor course placement accuracy and inconsistent standards associated with the traditional process. Taking a broad view of the extent of these approaches, we find that most colleges we studied adopted a measured approach that addressed a single limitation without attending to other limitations that contribute to the same overall problem of poor course placement accuracy or inconsistent standards. Much less common were comprehensive approaches that attended to multiple limitations of the process; these approaches were likely to result from changes to developmental education as a whole. Drawing from the study’s findings, we also discuss how colleges can overcome barriers to reform in order to implement approaches that hold promise for improved course placement accuracy, more consistent standards of college readiness, and, potentially, greater long-term academic success of community college students.Community college education, Educational tests and measurementsmeh70, sj2391, mjm305Institute on Education and the Economy, Economics and Education, Community College Research CenterWorking papersAssessing Developmental Assessment in Community Colleges
http://academiccommons.columbia.edu/catalog/ac:146946
Hughes, Katherine Lee; Scott-Clayton, Judith E.http://hdl.handle.net/10022/AC:P:13231Thu, 17 May 2012 15:34:34 +0000Placement exams are high-stakes assessments that determine many students' college trajectories. The majority of community colleges use placement exams—most often the ACCUPLACER, developed by the College Board, or the COMPASS, developed by ACT, Inc.—to sort students into college-level or developmental education courses in math, reading, and sometimes writing. More than half of entering students at community colleges are placed into developmental education in at least one subject as a result. But the evidence on the predictive validity of these tests is not as strong as many might assume, given the stakes involved—and recent research fails to find evidence that the resulting placements into remediation improve student outcomes. While this has spurred debate about the content and delivery of remedial coursework, it is possible that the assessment process itself may be broken; the debate about remediation policy is incomplete without a fuller understanding of the role of assessment. This Brief examines the role of developmental assessment, the validity of the most common assessments currently in use, and emerging directions in assessment policy and practice. Alternative methods of assessment—particularly those involving multiple measures of student preparedness—seem to have the potential to improve student outcomes, but more research is needed to determine what type of change in assessment and placement policy might improve persistence and graduation rates. The Brief concludes with a discussion of implications for policy and research.Community college education, Educational tests and measurementskh154, js3676Institute on Education and the Economy, International and Transcultural Studies, Community College Research CenterReportsOn the Use of Covariates in a Latent Class Signal Detection Model, with Applications to Constructed Response Scoring
http://academiccommons.columbia.edu/catalog/ac:146692
Wang, Zijian Geraldhttp://hdl.handle.net/10022/AC:P:13156Mon, 07 May 2012 10:52:58 +0000A latent class signal detection (SDT) model was recently introduced as an alternative to traditional item response theory (IRT) methods in the analysis of constructed response data. This class of models can be represented as restricted latent class models and differ from the IRT approach in the way the latent construct is conceptualized. One appeal of the signal detection approach is that it provides an intuitive framework from which psychological processes governing rater behavior can be better understood. The present study developed an extension of the latent class SDT model to include covariates and examined the performance of the resulting model. Covariates can be incorporated into the latent class SDT model in three ways: 1) to affect latent class membership, 2) conditional response probabilities and 3) both latent class membership and conditional response probabilities. In each case, simulations were conducted to investigate both parameter recovery and classification accuracy of the extended model under two competing rater designs; in addition, implications of ignoring covariate effects and covariate misspecification were explored. Here, the ability of information criteria, namely the AIC, small sample adjusted AIC and BIC, in recovering the true model with respect to how covariates are introduced was also examined. Results indicate that parameters were generally well recovered in fully-crossed designs; to obtain similar levels of estimation precision in incomplete designs, sample size requirements were comparatively higher and depend on the number of indicators used. When covariate effects were not accounted for or misspecified, results show that parameter estimates tend to be severely biased, which in turn reduced classification accuracy. With respect to model recovery, the BIC performed the most consistently amongst the information criteria considered. In light of these findings, recommendations were made with regard to sample size requirements and model building strategies when implementing the extended latent class SDT model.Educational tests and measurementszgw2Measurement and Evaluation, Human DevelopmentDissertationsAssessing Developmental Assessment in Community Colleges
http://academiccommons.columbia.edu/catalog/ac:146649
Hughes, Katherine Lee; Scott-Clayton, Judith E.http://hdl.handle.net/10022/AC:P:13143Fri, 04 May 2012 15:45:14 +0000Placement exams are high-stakes assessments that determine many students' college trajectories. The majority of community colleges use placement exams—most often the ACCUPLACER, developed by the College Board, or the COMPASS, developed by ACT, Inc.—to sort students into college-level or developmental education courses in math, reading, and sometimes writing. More than half of entering students at community colleges are placed into developmental education in at least one subject as a result. But the evidence on the predictive validity of these tests is not as strong as many might assume, given the stakes involved—and recent research fails to find evidence that the resulting placements into remediation improve student outcomes. While this has spurred debate about the content and delivery of remedial coursework, it is possible that the assessment process itself may be broken; the debate about remediation policy is incomplete without a fuller understanding of the role of assessment. This paper examines the extent of consensus regarding the role of developmental assessment and how it is best implemented, the validity of the most common assessments currently in use, and emerging directions in assessment policy and practice. Alternative methods of assessment—particularly those involving multiple measures of student preparedness—seem to have the potential to improve student outcomes, but more research is needed to determine what type of change in assessment and placement policy might improve persistence and graduation rates. The paper concludes with a discussion of gaps in the literature and implications for policy and research.Community college education, Educational tests and measurementskh154, js3676Institute on Education and the Economy, International and Transcultural Studies, Community College Research CenterWorking papersPredicting Success in College: The Importance of Placement Tests and High School Transcripts
http://academiccommons.columbia.edu/catalog/ac:146486
Belfield, Clive; Crosta, Peter Michaelhttp://hdl.handle.net/10022/AC:P:13086Wed, 02 May 2012 12:44:27 +0000This paper uses student-level data from a statewide community college system to examine the validity of placement tests and high school information in predicting course grades and college performance. It considers the ACCUPLACER and COMPASS placement tests, using two quantitative and two literacy tests from each battery. The authors find that placement tests do not yield strong predictions of how students will perform in college. Placement test scores are positively—but weakly—associated with college grade point average (GPA). The correlation disappears when high school GPA is controlled for. Placement test scores are positively associated with college credit accumulation even after controlling for high school GPA. After three to five semesters, a student with a placement test score in the highest quartile has on average nine credits more than a student with a placement test score in the lowest quartile. In contrast, high school GPAs are useful for predicting many aspects of students' college performance. High school GPA has a strong association with college GPA; students' college GPAs are approximately 0.6 units below their high school GPAs. High school GPA also has a strong association with college credit accumulation. A student whose high school GPA is one grade higher will have accumulate approximately four extra credits per semester. Other information from high school transcripts is modestly useful; this includes number of math and English courses taken in high school, honors courses, number of F grades, and number of credits. This high school information is not independently useful beyond high school GPA, and collectively it explains less variation in college performance. The authors also calculate accuracy rates and four validity metrics for placement tests. They find high "severe" error rates using the placement test cutoffs. The severe error rate for English is 27 to 33 percent; i.e., three out of every ten students is severely misassigned. For math, the severe error rates are lower but still nontrivial. Using high school GPA instead of placement tests reduces the severe error rates by half across both English and math.Community college education, Educational tests and measurementscb2505, pmc2107National Center for the Study of Privatization in Education, Institute on Education and the Economy, Economics and Education, Community College Research CenterWorking papersDo High-Stakes Placement Exams Predict College Success?
http://academiccommons.columbia.edu/catalog/ac:146482
Scott-Clayton, Judith E.http://hdl.handle.net/10022/AC:P:13085Wed, 02 May 2012 12:34:38 +0000Community colleges are typically assumed to be nonselective, open-access institutions. Yet access to college-level courses at such institutions is far from guaranteed: the vast majority of two-year institutions administer high-stakes exams to entering students that determine their placement into either college-level or remedial education. Despite the stakes involved, there has been relatively little research investigating whether such exams are valid for their intended purpose, or whether other measures of preparedness might be equally or even more effective. This paper contributes to the literature by analyzing the predictive validity of one of the most commonly used assessments, using data on over 42,000 first-time entrants to a large, urban community college system. Using both traditional correlation coefficients as well as more useful decision-theoretic measures of placement accuracy and error rates, I find that placement exams are more predictive of success in math than in English, and more predictive of who is likely to do well in college-level coursework than of who is likely to fail. Utilizing multiple measures to make placement decisions could reduce severe misplacements by about 15 percent without changing the remediation rate, or could reduce the remediation rate by 8 to 12 percentage points while maintaining or increasing success rates in college-level courses. Implications and limitations are discussed.Community college education, Educational tests and measurementsjs3676International and Transcultural Studies, Community College Research CenterWorking papersRater Drift in Constructed Response Scoring via Latent Class Signal Detection Theory and Item Response Theory
http://academiccommons.columbia.edu/catalog/ac:132272
Park, Yoon Soohttp://hdl.handle.net/10022/AC:P:10394Tue, 17 May 2011 15:29:48 +0000The use of constructed response (CR) items or performance tasks to assess test takers' ability has grown tremendously over the past decade. Examples of CR items in psychological and educational measurement range from essays, works of art, and admissions interviews. However, unlike multiple-choice (MC) items that have predetermined options, CR items require test takers to construct their own answer. As such, they require the judgment of multiple raters that are subject to differences in perception and prior knowledge of the material being evaluated. As with any scoring procedure, the scores assigned by raters must be comparable over time and over different test administrations and forms; in other words, scores must be reliable and valid for all test takers, regardless of when an individual takes the test. This study examines how longitudinal patterns or changes in rater behavior affect model-based classification accuracy. Rater drift refers to changes in rater behavior across different test administrations. Prior research has found evidence of drift. Rater behavior in CR scoring is examined using two measurement models - latent class signal detection theory (SDT) and item response theory (IRT) models. Rater effects (e.g., leniency and strictness) are partly examined with simulations, where the ability of different models to capture changes in rater behavior is studied. Drift is also examined in two real-world large scale tests: teacher certification test and high school writing test. These tests use the same set of raters for long periods of time, where each rater's scoring is examined on a monthly basis. Results from the empirical analysis showed that rater models were effective to detect changes in rater behavior over testing administrations in real-world data. However, there were differences in rater discrimination between the latent class SDT and IRT models. Simulations were used to examine the effect of rater drift on classification accuracy and on differences between the latent class SDT and IRT models. Changes in rater severity had only a minimal effect on classification. Rater discrimination had a greater effect on classification accuracy. This study also found that IRT models detected changes in rater severity and in rater discrimination even when data were generated from the latent class SDT model. However, when data were non-normal, IRT models underestimated rater discrimination, which may lead to incorrect inferences on the precision of raters. These findings provide new and important insights into CR scoring and issues that emerge in practice, including methods to improve rater training.Quantitative psychology and psychometrics, Educational tests and measurements, Statisticsysp2102Measurement and Evaluation, National Center for Disaster Preparedness, Human DevelopmentDissertationsThe central role of noise in evaluating interventions that use test scores to rank schools
http://academiccommons.columbia.edu/catalog/ac:115957
Chay, Kenneth; McEwan, Patrick J.; Urquiola, Miguel S.http://hdl.handle.net/10022/AC:P:493Thu, 24 Mar 2011 11:14:53 +0000Several countries have implemented programs that use test scores to rank schools, and to reward or penalize them based on their students' average performance. Recently, Kane and Staiger (2002) have warned that imprecision in the measurement of school-level test scores could impede these efforts. There is little evidence, however, on how seriously noise hinders the evaluation of the impact of these interventions. We examine these issues in the context of Chile's P-900 program-a country-wide intervention in which resources were allocated based on cutoffs in schools' mean test scores. We show that transitory noise in average scores and mean reversion lead conventional estimation approaches to greatly overstate the impacts of such programs. We then show how a regression discontinuity design that utilizes the discrete nature of the selection rule can be used to control for reversion biases. While the RD analysis provides convincing evidence that the P-900 program had significant effects on test score gains, these effects are much smaller than is widely believed.Educational tests and measurementsmsu2101Economics, International and Public AffairsWorking papers