Academic Commons Search Results
http://academiccommons.columbia.edu/catalog.rss?f%5Bsubject_facet%5D%5B%5D=Statistics&q=&rows=500&sort=record_creation_date+desc
Academic Commons Search Resultsen-usLearning Structure in Time Series for Neuroscience and Beyond
http://academiccommons.columbia.edu/catalog/ac:180952
Pfau, David Benjaminhttp://dx.doi.org/10.7916/D8WH2NRRThu, 04 Dec 2014 00:00:00 +0000Advances in neuroscience are producing data at an astounding rate - data which are fiendishly complex both to process and to interpret. Biological neural networks are high-dimensional, nonlinear, noisy, heterogeneous, and in nearly every way defy the simplifying assumptions of standard statistical methods. In this dissertation we address a number of issues with understanding the structure of neural populations, from the abstract level of how to uncover structure in generic time series, to the practical matter of finding relevant biological structure in state-of-the-art experimental techniques. To learn the structure of generic time series, we develop a new statistical model, which we dub the probabilistic deterministic infinite automata (PDIA), which uses tools from nonparametric Bayesian inference to learn a very general class of sequence models. We show that the models learned by the PDIA often offer better predictive performance and faster inference than Hidden Markov Models, while being significantly more compact than models that simply memorize contexts. For large populations of neurons, models like the PDIA become unwieldy, and we instead investigate ways to robustly reduce the dimensionality of the data. In particular, we adapt the generalized linear model (GLM) framework for regres- sion to the case of matrix completion, which we call the low-dimensional GLM. We show that subspaces and dynamics of neural activity can be accurately recovered from model data, and with only minimal assumptions about the structure of the dynamics can still lead to good predictive performance on real data. Finally, to bridge the gap between recording technology and analysis, particularly as recordings from ever-larger populations of neurons becomes the norm, automated methods for extracting activity from raw recordings become a necessity. We present a number of methods for automatically segmenting biological units from optical imaging data, with applications to light sheet recording of genetically encoded calcium indicator fluorescence in the larval zebrafish, and optical electrophysiology using genetically encoded voltage indicators in culture. Together, these methods are a powerful set of tools for addressing the diverse challenges of modern neuroscience.Neurosciences, Statisticsdbp2112Neurobiology and BehaviorDissertationsPreaching to the Unconverted
http://academiccommons.columbia.edu/catalog/ac:179470
Uriarte, Maria; Yackulic, Charles B.http://dx.doi.org/10.7916/D8SB44FMSun, 09 Nov 2014 00:00:00 +0000Rapid advances in computing in the past 20 years have lead to an explosion in the development and adoption of new statistical modeling tools (Gelman and Hill 2006, Clark 2007, Bolker 2008, Cressie et al. 2009). These innovations have occurred in parallel with a tremendous increase in the availability of ecological data. The latter has been fueled both by new tools that have facilitated data collection and management efforts (e.g., remote sensing, database management software, and so on) and by increased ease of data sharing through computers and the World Wide Web. The impending implementation of the National Ecological Observatory Network (NEON) will further boost data availability. These rapid advances in the ability of ecologists to collect data have not been matched by application of modern statistical tools. Given the critical questions ecology is facing (e.g., climate change, species extinctions, spread of invasives, irreversible losses of ecosystem services) and the benefits that can be gained from connecting existing data to models in a sophisticated inferential framework (Clark et al. 2001, Pielke and Connant 2003), it is important to understand why this mismatch exists. Such an understanding would point to the issues that must be addressed if ecologists are to make useful inferences from these new data and tools and contribute in substantial ways to management and decision making.Ecology, Statisticsmu2126Ecology, Evolution, and Environmental BiologyArticlesSPAr package for Fan and Lo (2013) "A robust model-free approach for rare variants association studies incorporating gene-gene and gene-environmental interactions."
http://academiccommons.columbia.edu/catalog/ac:179424
Fan, Ruixue; Lo, Shaw-Hwahttp://dx.doi.org/10.7916/D84Q7SN6Fri, 07 Nov 2014 00:00:00 +0000Recently more and more evidence suggest that rare variants with much lower minor allele frequencies play significant roles in disease etiology. Advances in next-generation sequencing technologies will lead to many more rare variants association studies. Several statistical methods have been proposed to assess the effect of rare variants by aggregating information from multiple loci across a genetic region and testing the association between the phenotype and aggregated genotype. One limitation of existing methods is that they only look into the marginal effects of rare variants but do not systematically take into account effects due to interactions among rare variants and between rare variants and environmental factors. In this article, we propose the summation of partition approach (SPA), a robust model-free method that is designed specifically for detecting both marginal effects and effects due to gene-gene (G×G) and gene-environmental (G×E) interactions for rare variants association studies. SPA has three advantages. First, it accounts for the interaction information and gains considerable power in the presence of unknown and complicated G×G or G×E interactions. Secondly, it does not sacrifice the marginal detection power; in the situation when rare variants only have marginal effects it is comparable with the most competitive method in current literature. Thirdly, it is easy to extend and can incorporate more complex interactions; other practitioners and scientists can tailor the procedure to fit their own study friendly. Our simulation studies show that SPA is considerably more powerful than many existing methods in the presence of G×G and G×E interactions. This package is also maintained on the Comprehensive R Archive Network (http://cran.r-project.org). It contains the R programs, user's manual and example codes.Genetics, Statisticsrf2283, shl5StatisticsComputer softwareSource codes for GLMLE algorithm
http://academiccommons.columbia.edu/catalog/ac:178966
He, Ranhttp://dx.doi.org/10.7916/D8HH6HQRFri, 24 Oct 2014 00:00:00 +0000These are the R source codes for the algorithm proposed for fitting exponential random graph models (ERGMs) on large social networks in our paper "Estimation of exponential random graph models for large social networks via graph limits". Specifically, the ERGM model we implement is the one that consider homomorphism densities of edges, two-stars and triangles, the one we examine in the above paper.Statistics, Computer sciencerh2528StatisticsComputer softwareMarkov Clustering on Person-to-Person Similarity Graph: Attribution of Movies’ Box Office Results to Preferences of Viewer Communities
http://academiccommons.columbia.edu/catalog/ac:177703
Tkachenko, Yegorhttp://dx.doi.org/10.7916/D87M06G5Mon, 29 Sep 2014 00:00:00 +0000Search for methods of deriving actionable marketing segmentation has a long history in the marketing literature. This work proposes the use of Markov clustering algorithm on person-to-person similarity graph, where similarity between individuals is based on their similarity in rating assignments. This allows the detection of taste-based communities of users. Simple regression analysis is subsequently applied to detect the dependencies of box office results of movies of various genres on the preferences of specific viewer communities. The resulting analysis permitted identification of communities that drive box office results of specific movie genres.Business, Marketing, Statisticsit2206BusinessMaster's thesesExpressionPlot: a web-based framework for analysis of RNA-Seq and microarray gene expression data
http://academiccommons.columbia.edu/catalog/ac:183139
Friedman, Brad; Maniatis, Tomhttp://dx.doi.org/10.7916/D82J6979Mon, 08 Sep 2014 00:00:00 +0000RNA-Seq and microarray platforms have emerged as important tools for detecting changes in gene expression and RNA processing in biological samples. We present ExpressionPlot, a software package consisting of a default back end, which prepares raw sequencing or Affymetrix microarray data, and a web-based front end, which offers a biologically centered interface to browse, visualize, and compare different data sets. Download and installation instructions, a user's manual, discussion group, and a prototype are available at http://expressionplot.comStatistics, Bioinformaticstm2472Biochemistry and Molecular BiophysicsArticlesCopy number variation genotyping using family information
http://academiccommons.columbia.edu/catalog/ac:180080
Chu, Jen-hwa; Rogers, Angela; Ionita-Laza, Iuliana; Darvishi, Katayoon; Mills, Ryan E.; Lee, Charles; Raby, Benjamin A.http://dx.doi.org/10.7916/D8HD7T0DMon, 08 Sep 2014 00:00:00 +0000Background: In recent years there has been a growing interest in the role of copy number variations (CNV) in genetic diseases. Though there has been rapid development of technologies and statistical methods devoted to detection in CNVs from array data, the inherent challenges in data quality associated with most hybridization techniques remains a challenging problem in CNV association studies. Results: To help address these data quality issues in the context of family-based association studies, we introduce a statistical framework for the intensity-based array data that takes into account the family information for copy-number assignment. The method is an adaptation of traditional methods for modeling SNP genotype data that assume Gaussian mixture model, whereby CNV calling is performed for all family members simultaneously and leveraging within family-data to reduce CNV calls that are incompatible with Mendelian inheritance while still allowing de-novo CNVs. Applying this method to simulation studies and a genome-wide association study in asthma, we find that our approach significantly improves CNV calls accuracy, and reduces the Mendelian inconsistency rates and false positive genotype calls. The results were validated using qPCR experiments. Conclusions: In conclusion, we have demonstrated that the use of family information can improve the quality of CNV calling and hopefully give more powerful association test of CNVs.Genetics, Statisticsii2135Mailman School of Public Health, BiostatisticsArticlesReporting of analyses from randomized controlled trials with multiple arms: a systematic review
http://academiccommons.columbia.edu/catalog/ac:180137
Baron, Gabriel; Perrodeau, Elodie; Boutron, Isabelle; Ravaud, Philippehttp://dx.doi.org/10.7916/D837772TMon, 08 Sep 2014 00:00:00 +0000Background: Multiple-arm randomized trials can be more complex in their design, data analysis, and result reporting than two-arm trials. We conducted a systematic review to assess the reporting of analyses in reports of randomized controlled trials (RCTs) with multiple arms. Methods: The literature in the MEDLINE database was searched for reports of RCTs with multiple arms published in 2009 in the core clinical journals. Two reviewers extracted data using a standardized extraction form. Results: In total, 298 reports were identified. Descriptions of the baseline characteristics and outcomes per group were missing in 45 reports (15.1%) and 48 reports (16.1%), respectively. More than half of the articles (n = 171, 57.4%) reported that a planned global test comparison was used (that is, assessment of the global differences between all groups), but 67 (39.2%) of these 171 articles did not report details of the planned analysis. Of the 116 articles reporting a global comparison test, 12 (10.3%) did not report the analysis as planned. In all, 60% of publications (n = 180) described planned pairwise test comparisons (that is, assessment of the difference between two groups), but 20 of these 180 articles (11.1%) did not report the pairwise test comparisons. Of the 204 articles reporting pairwise test comparisons, the comparisons were not planned for 44 (21.6%) of them. Less than half the reports (n = 137; 46%) provided baseline and outcome data per arm and reported the analysis as planned. Conclusions: Our findings highlight discrepancies between the planning and reporting of analyses in reports of multiple-arm trials.Statistics, Health sciencespr2341EpidemiologyArticlesHelping the decision maker effectively promote various experts’ views into various optimal solutions to China’s institutional problem of health care provider selection.
http://academiccommons.columbia.edu/catalog/ac:180132
Tang, Liyanghttp://dx.doi.org/10.7916/D8BP0147Mon, 08 Sep 2014 00:00:00 +0000Background: The main aim of China’s Health Care System Reform was to help the decision maker find the optimal solution to China’s institutional problem of health care provider selection. A pilot health care provider research system was recently organized in China’s health care system, and it could efficiently collect the data for determining the optimal solution to China’s institutional problem of health care provider selection from various experts, then the purpose of this study was to apply the optimal implementation methodology to help the decision maker effectively promote various experts’ views into various optimal solutions to this problem under the support of this pilot system. Methods: After the general framework of China’s institutional problem of health care provider selection was established, this study collaborated with the National Bureau of Statistics of China to commission a large-scale 2009 to 2010 national expert survey (n = 3,914) through the organization of a pilot health care provider research system for the first time in China, and the analytic network process (ANP) implementation methodology was adopted to analyze the dataset from this survey. Results: The market-oriented health care provider approach was the optimal solution to China’s institutional problem of health care provider selection from the doctors’ point of view; the traditional government’s regulation-oriented health care provider approach was the optimal solution to China’s institutional problem of health care provider selection from the pharmacists’ point of view, the hospital administrators’ point of view, and the point of view of health officials in health administration departments; the public private partnership (PPP) approach was the optimal solution to China’s institutional problem of health care provider selection from the nurses’ point of view, the point of view of officials in medical insurance agencies, and the health care researchers’ point of view. Conclusions: The data collected through a pilot health care provider research system in the 2009 to 2010 national expert survey could help the decision maker effectively promote various experts’ views into various optimal solutions to China’s institutional problem of health care provider selection.Statistics, BusinessBusinessArticlesHydroclimatology of Extreme Precipitation and Floods Originating from the North Atlantic Ocean
http://academiccommons.columbia.edu/catalog/ac:177151
Nakamura, Jennifer Annehttp://dx.doi.org/10.7916/D86H4FM1Fri, 15 Aug 2014 00:00:00 +0000This study explores seasonal patterns and structures of moisture transport pathways from the North Atlantic Ocean and the Gulf of Mexico that lead to extreme large-scale precipitation and floods over land. Storm tracks, such as the tropical cyclone tracks in the Northern Atlantic Ocean, are an example of moisture transport pathways. In the first part, North Atlantic cyclone tracks are clustered by the moments to identify common traits in genesis locations, track shapes, intensities, life spans, landfalls, seasonal patterns, and trends. The clustering results of part one show the dynamical behavior differences of tropical cyclones born in different parts of the basin. Drawing on these conclusions, in the second part, statistical track segment model is developed for simulation of tracks to improve reliability of tropical cyclone risk probabilities. Moisture transport pathways from the North Atlantic Ocean are also explored though the specific regional flood dynamics of the U.S. Midwest and the United Kingdom in part three of the dissertation. Part I. Classifying North Atlantic Tropical Cyclones Tracks by Mass Moments. A new method for classifying tropical cyclones or similar features is introduced. The cyclone track is considered as an open spatial curve, with the wind speed or power information along the curve considered as a mass attribute. The first and second moments of the resulting object are computed and then used to classify the historical tracks using standard clustering algorithms. Mass moments allow the whole track shape, length and location to be incorporated into the clustering methodology. Tropical cyclones in the North Atlantic basin are clustered with K-means by mass moments producing an optimum of six clusters with differing genesis locations, track shapes, intensities, life spans, landfalls, seasonality, and trends. Even variables that are not directly clustered show distinct separation between clusters. A trend analysis confirms recent conclusions of increasing tropical cyclones in the basin over the past two decades. However, the trends vary across clusters. Part II: Tropical cyclone Intensity and Track Simulator (HITS) with Atlantic Ocean Applications for Risk Assessment. A nonparametric stochastic model is developed and tested for the simulation of tropical cyclone tracks. Tropical cyclone tracks demonstrate continuity and memory over many time and space steps. Clusters of tracks can be coherent, and the separation between clusters may be marked by geographical locations where groups of tracks diverge due to the physics of the underlying process. Consequently, their evolution may be non-Markovian. Markovian simulation models, as often used, may produce tracks that potentially diverge or lose memory quicker than nature. This is addressed here through a model that simulates tracks by randomly sampling track segments of varying length, selected from historical tracks. For performance evaluation, a spatial grid is imposed on the domain of interest. For each grid box, long-term tropical cyclone risk is assessed through the annual probability distributions of the number of storm hours, landfalls, winds, and other statistics. Total storm length is determined at birth by local distribution, and movement to other tropical cyclone segments by distance to neighbor tracks, comparative vector, and age of track. An assessment of the performance for tropical cyclone track simulation and potential directions for the improvement and use of such model are discussed. Part III: Dynamical Structure of Extreme Floods in the U.S. Midwest and the United Kingdom. Twenty extreme spring floods that occurred in the Ohio Basin between 1901 and 2008, identified from daily river discharge data, are investigated and compared to the April 2011 Ohio River flood event. Composites of synoptic fields for the flood events show that all these floods are associated with a similar pattern of sustained advection of low-level moisture and warm air from the tropical Atlantic Ocean and the Gulf of Mexico. The typical flow conditions are governed by an anomalous semi-stationary ridge situated east of the US East Coast, which steers the moisture and converges it into the Ohio Valley. Significantly, the moisture path common to all the 20 cases studied here as well as the case of April 2011 is distinctly different from the normal path of Atlantic moisture during spring, which occurs further west. It is shown further that the Ohio basin moisture convergence responsible for the floods is caused primarily by the atmospheric circulation anomaly advecting the climatological mean moisture field. Transport and related convergence due to the covariance between moisture anomalies and circulation anomalies are of secondary but non-negligible importance. The importance of atmospheric circulation anomalies to floods is confirmed by conducting a similar analysis for a series of winter floods on the River Eden in northwest England.Atmospheric sciences, Hydrologic sciences, Statisticsjam148Earth and Environmental EngineeringDissertationsConvex Optimization Algorithms and Recovery Theories for Sparse Models in Machine Learning
http://academiccommons.columbia.edu/catalog/ac:175385
Huang, Bohttp://dx.doi.org/10.7916/D8VM49DMMon, 07 Jul 2014 00:00:00 +0000Sparse modeling is a rapidly developing topic that arises frequently in areas such as machine learning, data analysis and signal processing. One important application of sparse modeling is the recovery of a high-dimensional object from relatively low number of noisy observations, which is the main focuses of the Compressed Sensing, Matrix Completion(MC) and Robust Principal Component Analysis (RPCA) . However, the power of sparse models is hampered by the unprecedented size of the data that has become more and more available in practice. Therefore, it has become increasingly important to better harnessing the convex optimization techniques to take advantage of any underlying "sparsity" structure in problems of extremely large size. This thesis focuses on two main aspects of sparse modeling. From the modeling perspective, it extends convex programming formulations for matrix completion and robust principal component analysis problems to the case of tensors, and derives theoretical guarantees for exact tensor recovery under a framework of strongly convex programming. On the optimization side, an efficient first-order algorithm with the optimal convergence rate has been proposed and studied for a wide range of problems of linearly constraint sparse modeling problems.Mathematics, Statistics, Operations researchIndustrial Engineering and Operations ResearchDissertationsEstimating the Q-matrix for Cognitive Diagnosis Models in a Bayesian Framework
http://academiccommons.columbia.edu/catalog/ac:176107
Chung, Meng-tahttp://dx.doi.org/10.7916/D857195BMon, 07 Jul 2014 00:00:00 +0000This research aims to develop an MCMC algorithm for estimating the Q-matrix in a Bayesian framework. A saturated multinomial model was used to estimate correlated attributes in the DINA model and rRUM. Closed-forms of posteriors for guess and slip parameters were derived for the DINA model. The random walk Metropolis-Hastings algorithm was applied to parameter estimation in the rRUM. An algorithm for reducing potential label switching was incorporated into the estimation procedure. A method for simulating data with correlated attributes for the DINA model and rRUM was offered. Three simulation studies were conducted to evaluate the algorithm for Bayesian estimation. Twenty simulated data sets for simulation study 1 were generated from independent attributes for the DINA model and rRUM. A hundred data sets from correlated attributes were generated for the DINA and rRUM with guess and slip parameters set to 0.2 in simulation study 2. Simulation study 3 analyzed data sets simulated from the DINA model with guess and slip parameters generated from Uniform (0.1, 0.4). Results from simulation studies showed that the Q-matrix recovery rate was satisfactory. Using the fraction-subtraction data, an empirical study was conducted for the DINA model and rRUM. The estimated Q-matrices from the two models were compared with the expert-designed Q-matrix.Quantitative psychology and psychometrics, Statistics, Educational tests and measurementsHuman Development, Measurement and EvaluationDissertationsPopulation Genetics of Identity By Descent
http://academiccommons.columbia.edu/catalog/ac:175990
Palamara, Pier Francescohttp://dx.doi.org/10.7916/D8V122XTMon, 07 Jul 2014 00:00:00 +0000Recent improvements in high-throughput genotyping and sequencing technologies have afforded the collection of massive, genome-wide datasets of DNA information from hundreds of thousands of individuals. These datasets, in turn, provide unprecedented opportunities to reconstruct the history of human populations and detect genotype-phenotype association. Recently developed computational methods can identify long-range chromosomal segments that are identical across samples, and have been transmitted from common ancestors that lived tens to hundreds of generations in the past. These segments reveal genealogical relationships that are typically unknown to the carrying individuals. In this work, we demonstrate that such identical-by-descent (IBD) segments are informative about a number of relevant population genetics features: they enable the inference of details about past population size fluctuations, migration events, and they carry the genomic signature of natural selection. We derive a mathematical model, based on coalescent theory, that allows for a quantitative description of IBD sharing across purportedly unrelated individuals, and develop inference procedures for the reconstruction of recent demographic events, where classical methodologies are statistically underpowered. We analyze IBD sharing in several contemporary human populations, including representative communities of the Jewish Diaspora, Kenyan Maasai samples, and individuals from several Dutch provinces, in all cases retrieving evidence of fine-scale demographic events from recent history. Finally, we expand the presented model to describe distributions for those sites in IBD shared segments that harbor mutation events, showing how these may be used for the inference of mutation rates in humans and other species.Genetics, Computer science, Statisticspp2314Computer ScienceDissertationsStatistical Inference and Experimental Design for Q-matrix Based Cognitive Diagnosis Models
http://academiccommons.columbia.edu/catalog/ac:176169
Zhang, Stephaniehttp://dx.doi.org/10.7916/D8TQ5ZP5Mon, 07 Jul 2014 00:00:00 +0000There has been growing interest in recent years in using cognitive diagnosis models for diagnostic measurement, i.e., classification according to multiple discrete latent traits. The Q-matrix, an incidence matrix specifying the presence or absence of a relationship between each item in the assessment and each latent attribute, is central to many of these models. Important applications include educational and psychological testing; demand in education, for example, has been driven by recent focus on skills-based evaluation. However, compared to more traditional models coming from classical test theory and item response theory, cognitive diagnosis models are relatively undeveloped and suffer from several issues limiting their applicability. This thesis exams several issues related to statistical inference and experimental design for Q-matrix based cognitive diagnosis models. We begin by considering one of the main statistical issues affecting the practical use of Q-matrix based cognitive diagnosis models, the identifiability issue. In statistical models, identifiability is prerequisite for most common statistical inferences, including parameter estimation and hypothesis testing. With Q-matrix based cognitive diagnosis models, identifiability also affects the classification of respondents according to their latent traits. We begin by examining the identifiability of model parameters, presenting necessary and sufficient conditions for identifiability in several settings. Depending on the area of application and the researcher's degree of control over the experiment design, fulfilling these identifiability conditions may be difficult. The second part of this thesis proposes new methods for parameter estimation and respondent classification for use with non-identifiable models. In addition, our framework allows consistent estimation of the severity of the non-identifiability problem, in terms of the proportion of the population affected by it. The implications of this measure for the design of diagnostic assessments are also discussed.Statistics, Educational tests and measurements, Quantitative psychology and psychometricsStatisticsDissertationsUnbiased Penetrance Estimates with Unknown Ascertainment Strategies
http://academiccommons.columbia.edu/catalog/ac:175879
Gore, Kristenhttp://dx.doi.org/10.7916/D8KP8098Mon, 07 Jul 2014 00:00:00 +0000Allelic variation in the genome leads to variation in individuals' production of proteins. This, in turn, leads to variation in traits and development, and, in some cases, to diseases. Understanding the genetic basis for disease can aid in the search for therapies and in guiding genetic counseling. Thus, it is of interest to discover the genes with mutations responsible for diseases and to understand the impact of allelic variation at those genes. A subject's genetic composition is commonly referred to as the subject's genotype. Subjects who carry the gene mutation of interests are referred to as carriers. Subjects who are afflicted with a disease under study (that is, subjects who exhibit the phenotype) are termed affected carriers. The age-specific probability that a given subject will exhibit a phenotype of interest, given mutation status at a gene is known as penetrance. Understanding penetrance is an important facet of genetic epidemiology. Penetrance estimates are typically calculated via maximum likelihood from family data. However, penetrance estimates can be biased if the nature of the sampling strategy is not correctly reflected in the likelihood. Unfortunately, sampling of family data may be conducted in a haphazard fashion or, even if conducted systematically, might be reported in an incomplete fashion. Bias is possible in applying likelihood methods to reported data if (as is commonly the case) some unaffected family members are not represented in the reports. The purpose here is to present an approach to find efficient and unbiased penetrance estimates in cases where there is incomplete knowledge of the sampling strategy and incomplete information on the full pedigree structure of families included in the data. The method may be applied with different conjectural assumptions about the ascertainment strategy to balance the possibly biasing effects of wishful assumptions about the sampling strategy with the efficiency gains that could be obtained through valid assumptions.StatisticsStatisticsDissertationsLearning Theory Analysis for Association Rules and Sequential Event Prediction
http://academiccommons.columbia.edu/catalog/ac:173905
Rudin, Cynthia; Letham, Benjamin; Madigan, David B.http://dx.doi.org/10.7916/D82N50C1Thu, 15 May 2014 00:00:00 +0000We present a theoretical analysis for prediction algorithms based on association rules. As part of this analysis, we introduce a problem for which rules are particularly natural, called “sequential event prediction." In sequential event prediction, events in a sequence are revealed one by one, and the goal is to determine which event will next be revealed. The training set is a collection of past sequences of events. An example application is to predict which item will next be placed into a customer's online shopping cart, given his/her past purchases. In the context of this problem, algorithms based on association rules have distinct advantages over classical statistical and machine learning methods: they look at correlations based on subsets of co-occurring past events (items a and b imply item c), they can be applied to the sequential event prediction problem in a natural way, they can potentially handle the “cold start" problem where the training set is small, and they yield interpretable predictions. In this work, we present two algorithms that incorporate association rules. These algorithms can be used both for sequential event prediction and for supervised classification, and they are simple enough that they can possibly be understood by users, customers, patients, managers, etc. We provide generalization guarantees on these algorithms based on algorithmic stability analysis from statistical learning theory. We include a discussion of the strict minimum support threshold often used in association rule mining, and introduce an “adjusted confidence" measure that provides a weaker minimum support condition that has advantages over the strict minimum support. The paper brings together ideas from statistical learning theory, association rule mining and Bayesian analysis.Statistics, Artificial intelligencedm2418StatisticsArticlesA Characterization of Markov Equivalence Classes for Acyclic Digraphs
http://academiccommons.columbia.edu/catalog/ac:173896
Andersson, Steen A.; Madigan, David B.; Perlman, Michael D.http://dx.doi.org/10.7916/D8FX77J3Thu, 15 May 2014 00:00:00 +0000Undirected graphs and acyclic digraphs (ADG's), as well as their mutual extension to chain graphs, are widely used to describe dependencies among variables in multiviarate distributions. In particular, the likelihood functions of ADG models admit convenient recursive factorizations that often allow explicit maximum likelihood estimates and that are well suited to building Bayesian networks for expert systems. Whereas the undirected graph associated with a dependence model is uniquely determined, there may be many ADG's that determine the same dependence (i.e., Markov) model. Thus, the family of all ADG's with a given set of vertices is naturally partitioned into Markov-equivalence classes, each class being associated with a unique statistical model. Statistical procedures, such as model selection of model averaging, that fail to take into account these equivalence classes may incur substantial computational or other inefficiences. Here it is show that each Markov-equivalence class is uniquely determined by a single chain graph, the essential graph, that is itself simultaneously Markov equivalent to all ADG's in the equivalence class. Essential graphs are characterized, a polynomial-time algorithm for their construction is given, and their applications to model selection and other statistical questions are described.Mathematics, Statistics, Theoretical mathematicsdm2418StatisticsArticlesMedication-Wide Association Studies
http://academiccommons.columbia.edu/catalog/ac:173912
Ryan, P. B.; Stang, P. E.; Madigan, David B.; Schuemie, M. J.; Hripcsak, George M.http://dx.doi.org/10.7916/D8PG1PVXThu, 15 May 2014 00:00:00 +0000Undiscovered side effects of drugs can have a profound effect on the health of the nation, and electronic health-care databases offer opportunities to speed up the discovery of these side effects. We applied a “medication-wide association study” approach that combined multivariate analysis with exploratory visualization to study four health outcomes of interest in an administrative claims database of 46 million patients and a clinical database of 11 million patients. The technique had good predictive value, but there was no threshold high enough to eliminate false-positive findings. The visualization not only highlighted the class effects that strengthened the review of specific products but also underscored the challenges in confounding. These findings suggest that observational databases are useful for identifying potential associations that warrant further consideration but are unlikely to provide definitive evidence of causal effects.Pharmacology, Statistics, Bioinformaticsdm2418, gh13Statistics, Biomedical InformaticsArticlesAlgorithms for Sparse Linear Classifiers in the Massive Data Setting
http://academiccommons.columbia.edu/catalog/ac:173908
Balakrishnan, Suhrid; Bartlett, Peter; Madigan, David B.http://dx.doi.org/10.7916/D8Z0368XThu, 15 May 2014 00:00:00 +0000Classifiers favoring sparse solutions, such as support vector machines, relevance vector machines, LASSO-regression based classifiers, etc., provide competitive methods for classification problems in high dimensions. However, current algorithms for training sparse classifiers typically scale quite unfavorably with respect to the number of training examples. This paper proposes online and multi-pass algorithms for training sparse linear classifiers for high dimensional data. These algorithms have computational complexity and memory requirements that make learning on massive data sets feasible. The central idea that makes this possible is a straightforward quadratic approximation to the likelihood function.Statistics, Artificial intelligencedm2418StatisticsArticlesA One-Pass Sequential Monte Carlo Method for Bayesian Analysis of Massive Datasets
http://academiccommons.columbia.edu/catalog/ac:173899
Balakrishnan, Suhrid; Madigan, David B.http://dx.doi.org/10.7916/D8B56GTPThu, 15 May 2014 00:00:00 +0000For Bayesian analysis of massive data, Markov chain Monte Carlo (MCMC) techniques often prove infeasible due to computational resource constraints. Standard MCMC methods generally require a complete scan of the dataset for each iteration. Ridgeway and Madigan (2002) and Chopin (2002b) recently presented importance sampling algorithms that combined simulations from a posterior distribution conditioned on a small portion of the dataset with a reweighting of those simulations to condition on the remainder of the dataset. While these algorithms drastically reduce the number of data accesses as compared to traditional MCMC, they still require substantially more than a single pass over the dataset. In this paper, we present "1PFS," an efficient, one-pass algorithm. The algorithm employs a simple modification of the Ridgeway and Madigan (2002) particle filtering algorithm that replaces the MCMC based "rejuvenation" step with a more efficient "shrinkage" kernel smoothing based step. To show proof-of-concept and to enable a direct comparison, we demonstrate 1PFS on the same examples presented in Ridgeway and Madigan (2002), namely a mixture model for Markov chains and Bayesian logistic regression. Our results indicate the proposed scheme delivers accurate parameter estimates while employing only a single pass through the data.Mathematics, Statisticsdm2418StatisticsArticlesAnalysis of Variance of Cross-Validation Estimators of the Generalization Error
http://academiccommons.columbia.edu/catalog/ac:173902
Markatou, Marianthi; Tian, Hong; Biswas, Shameek; Hripcsak, George M.http://dx.doi.org/10.7916/D86D5R2XThu, 15 May 2014 00:00:00 +0000This paper brings together methods from two different disciplines: statistics and machine learning. We address the problem of estimating the variance of cross-validation (CV) estimators of the generalization error. In particular, we approach the problem of variance estimation of the CV estimators of generalization error as a problem in approximating the moments of a statistic. The approximation illustrates the role of training and test sets in the performance of the algorithm. It provides a unifying approach to evaluation of various methods used in obtaining training and test sets and it takes into account the variability due to different training and test sets. For the simple problem of predicting the sample mean and in the case of smooth loss functions, we show that the variance of the CV estimator of the generalization error is a function of the moments of the random variables Y=Card(Sj ∩ Sj') and Y*=Card(Sjc ∩ Sj'c), where Sj, Sj' are two training sets, and Sjc, Sj'c are the corresponding test sets. We prove that the distribution of Y and Y* is hypergeometric and we compare our estimator with the one proposed by Nadeau and Bengio (2003). We extend these results in the regression case and the case of absolute error loss, and indicate how the methods can be extended to the classification case. We illustrate the results through simulation.Statistics, Artificial intelligencemm168, ht2031, spb2003, gh13Statistics, Biomedical Informatics, BiostatisticsArticlesBook Reviews: Principles of Data Mining. By David Hand, Heikki Mannila, and Padhraic Smyth.
http://academiccommons.columbia.edu/catalog/ac:173915
Madigan, David B.http://dx.doi.org/10.7916/D8DZ06D8Thu, 15 May 2014 00:00:00 +0000"Principles of Data Mining. By David Hand, Heikki Mannila, and Padhraic Smyth. MIT Press, Cambridge, MA, 2001. $50.00. xxxii+546 pp., hardcover. ISBN 0-262-08290-X. Is data mining the same as statistics? The distinguished authors of Principles of Data Mining struggle to make a distinction between the two subjects. In the end, what they have written is a fine applied statistics text." -- page 501Statisticsdm2418StatisticsReviewsBayesian Hierarchical Rule Modeling for Predicting Medical Conditions
http://academiccommons.columbia.edu/catalog/ac:173882
McCormick, Tyler H.; Rudin, Cynthia; Madigan, David B.http://dx.doi.org/10.7916/D8V69GP1Wed, 14 May 2014 00:00:00 +0000We propose a statistical modeling technique, called the Hierarchical Association Rule Model (HARM), that predicts a patient’s possible future medical conditions given the patient’s current and past history of reported conditions. The core of our technique is a Bayesian hierarchical model for selecting predictive association rules (such as “condition 1 and condition 2 → condition 3”) from a large set of candidate rules. Because this method “borrows strength” using the conditions of many similar patients, it is able to provide predictions specialized to any given patient, even when little information about the patient’s history of conditions is available.Applied mathematics, Statistics, Medicinedm2418StatisticsArticlesCorrection: Separation and completeness properties for AMP chain graph Markov models
http://academiccommons.columbia.edu/catalog/ac:173887
Madigan, David B.; Levitz, Michael; Perlman, Michael D.http://dx.doi.org/10.7916/D8QF8R05Wed, 14 May 2014 00:00:00 +0000Correction of table 2 on page 1757 of 'Separation and completeness properties for AMP chain graph Markov models', Annals of Statistics, volume 29 (2001).Mathematics, Statisticsdm2418StatisticsArticlesLocation Estimation in Wireless Networks: A Bayesian Approach
http://academiccommons.columbia.edu/catalog/ac:173820
Madigan, David B.; Ju, Wen-Hua; Krishnan, P.; Krishnakumar, A. S. ; Zorych, Ivanhttp://dx.doi.org/10.7916/D82V2D74Tue, 13 May 2014 00:00:00 +0000We present a Bayesian hierarchical model for indoor location estimation in wireless networks. We demonstrate that out model achieves accuracy that is similar to other published models and algorithms. By harnessing prior knowledge, our model drastically reduces the requirement for training data as compared with existing approaches.Mathematics, Statistics, Applied mathematicsdm2418StatisticsArticlesA Flexible Bayesian Generalized Linear Model for Dichotomous Response Data with an Application to Text Categorization
http://academiccommons.columbia.edu/catalog/ac:173817
Eyheramendy, Susana; Madigan, David B.http://dx.doi.org/10.7916/D86M34ZFTue, 13 May 2014 00:00:00 +0000We present a class of sparse generalized linear models that include probit and logistic regression as special cases and offer some extra flexibility. We provide an EM algorithm for learning the parameters of these models from data. We apply our method in text classification and in simulated data and show that our method outperforms the logistic and probit models and also the elastic net, in general by a substantial margin.Mathematics, Statistics, Theoretical mathematicsdm2418StatisticsBook chaptersGenerating Productive Dialogue between Consulting Statisticians and their Clients in the Pharmaceutical and Medical Research Settings
http://academiccommons.columbia.edu/catalog/ac:173832
Emir, Birol; Amaratunga, Dhammika; Beltangady, Mohan; Cabrera, Javier; Freeman, Roy; Madigan, David B.; Nguyen, Ha H.; Whalen, Edward Patrickhttp://dx.doi.org/10.7916/D8PK0D8NTue, 13 May 2014 00:00:00 +0000Due to the ever-increasing complexity of scientific technologies and resulting data, consulting statisticians are becoming more involved in the design, conduct, and analysis of biomedical research. This requires extensive collaboration between the consulting statistician and nonstatisticians, such as researchers, clinicians, and corporate executives. Consequently, a successful consulting career is becoming ever more dependent on the statistician's ability to effectively communicate with nonstatisticians. This is especially true when more complex, nontraditional analytical methods are required. In this paper, we examine the collaboration between statisticians and nonstatisticians from three different professional perspectives. Integrating these perspectives, we discuss ways to help the consulting statistician generate productive dialogue with clients. Finally, we examine how universities can better prepare students for careers in statistical consulting by incorporating more communication-based elements into their curriculum and by offering students ample opportunities to collaborate with nonstatisticians. Overall, we designed this exercise to help the consulting statistician generate dialogue with clients that results in more productive collaborations and a more satisfying work experience.Statistics, Bioinformatics, Medicinebe2166, dm2418, hhn2108, ew2320StatisticsArticles[Least Angle Regression]: Discussion
http://academiccommons.columbia.edu/catalog/ac:173841
Madigan, David B.; Ridgeway, Greghttp://dx.doi.org/10.7916/D81V5C29Tue, 13 May 2014 00:00:00 +0000Algorithms for simultaneous shrinkage and selection in regression and classification provide attractive solutions to knotty old statistical challenges. Nevertheless, as far as we can tell, Tibshirani's Lasso algorithm has had little impact on statistical practice. Two particular reasons for this may be the relative inefficiency of the original Lasso algorithm and the relative complexity of more recent Lasso algorithms [e.g., Osborne, Presnell and Turlach (2000)]. Efron, Hastie, Johnstone and Tibshirani have provided an efficient, simple algorithm for the Lasso as well as algorithms for stagewise regression and the new least angle regression. As such this paper is an important contribution to statistical computing.Mathematics, Statisticsdm2418StatisticsArticlesA Note on Equivalence Classes of Directed Acyclic Independence Graphs
http://academiccommons.columbia.edu/catalog/ac:173826
Madigan, David B.http://dx.doi.org/10.7916/D8TB150CTue, 13 May 2014 00:00:00 +0000Directed acyclic independence graphs (DAIGs) play an important role in recent developments in probabilistic expert systems and influence diagrams (Chyu [1]). The purpose of this note is to show that DAIGs can usefully be grouped into equivalence classes where the members of a single class share identical Markov properties. These equivalence classes can be identified via a simple graphical criterion. This result is particularly relevant to model selection procedures for DAIGs (see, e.g., Cooper and Herskovits [2] and Madigan and Raftery [4]) because it reduces the problem of searching among possible orientations of a given graph to that of searching among the equivalence classes.Mathematics, Statisticsdm2418StatisticsArticlesBayesian Model Averaging: a Tutorial (with Comments by M. Clyde, David Draper and E. I. George, and a Rejoinder by the Authors)
http://academiccommons.columbia.edu/catalog/ac:173853
Hoeting, Jennifer A.; Madigan, David B.; Raftery, Adrian E.; Volinsky, Chris T.; Clyde, M.; Draper, David; George, E. I.http://dx.doi.org/10.7916/D84M92N7Tue, 13 May 2014 00:00:00 +0000Standard statistical practice ignores model uncertainty. Data analysts typically select a model from some class of models and then proceed as if the selected model had generated the data. This approach ignores the uncertainty in model selection, leading to over-confident inferences and decisions that are more risky than one thinks they are. Bayesian model averaging (BMA)provides a coherent mechanism for accounting for this model uncertainty. Several methods for implementing BMA have recently emerged. We discuss these methods and present a number of examples.In these examples, BMA provides improved out-of-sample predictive performance. We also provide a catalogue of currently available BMA software.Statisticsdm2418StatisticsArticlesSeparation and Completeness Properties for Amp Chain Graph Markov Models
http://academiccommons.columbia.edu/catalog/ac:173847
Levitz, Michael; Perlman, Michael D.; Madigan, David B.http://dx.doi.org/10.7916/D8X34VJGTue, 13 May 2014 00:00:00 +0000Pearl’s well-known d-separation criterion for an acyclic directed graph (ADG) is a pathwise separation criterion that can be used to efficiently identify all valid conditional independence relations in the Markov model determined by the graph. This paper introduces p-separation, a pathwise separation criterion that efficiently identifies all valid conditional independences under the Andersson–Madigan–Perlman (AMP) alternative Markov property for chain graphs (= adicyclic graphs), which include both ADGs and undirected graphs as special cases. The equivalence of p-separation to the augmentation criterion occurring in the AMP global Markov property is established, and p-separation is applied to prove completeness of the global Markov property for AMP chain graph models. Strong completeness of the AMP Markov property is established, that is, the existence of Markov perfect distributions that satisfy those and only those conditional independences implied by the AMP property(equivalently, by p-separation). A linear-time algorithm for determining p-separation is presented.Mathematics, Statistics, Theoretical mathematicsdm2418StatisticsArticles[Bayesian Analysis in Expert Systems]: Comment: What's Next?
http://academiccommons.columbia.edu/catalog/ac:173856
Madigan, David B.http://dx.doi.org/10.7916/D8W37TFJTue, 13 May 2014 00:00:00 +0000"These papers represent two of the many different graphical modeling camps that have emerged from a flurry of activity in the past decade. The paper by Cox and Wermuth falls within the statistical graphical modeling camp and provides a useful generalization of that body of work. There is, of course, a price to be paid for this generality, namely that the interpretation of the graphs is more complex...The paper by Spiegelhalter, Dawid, Lauritzen and Cowell falls within the probabilistic expert system camp. This is a tour de force by researchers responsible for much of the astonishing progress in this area. Ten years ago, probabilistic models were shunned by the artificial intelligence community. That they are now widely accepted and used is due in large measure to the insights and efforts of these authors, along with other pioneers such as Judea Pearl and Peter Cheeseman..." -- page 261Mathematics, Statisticsdm2418StatisticsArticlesA Hierarchical Model for Association Rule Mining of Sequential Events: An Approach to Automated Medical Symptom Prediction
http://academiccommons.columbia.edu/catalog/ac:173838
McCormick, Tyler H.; Rudin, Cynthia; Madigan, David B.http://dx.doi.org/10.7916/D89C6VJDTue, 13 May 2014 00:00:00 +0000In many healthcare settings, patients visit healthcare professionals periodically and report multiple medical conditions, or symptoms, at each encounter. We propose a statistical modeling technique, called the Hierarchical Association Rule Model (HARM), that predicts a patient’s possible future symptoms given the patient’s current and past history of reported symptoms. The core of our technique is a Bayesian hierarchical model for selecting predictive association rules (such as “symptom 1 and symptom 2 → symptom 3 ”) from a large set of candidate rules. Because this method “borrows strength” using the symptoms of many similar patients, it is able to provide predictions specialized to any given patient, even when little information about the patient’s history of symptoms is available.Mathematics, Statistics, Medicinedm2418StatisticsArticlesFit GFuseTLP penalized conditional logistic regression model for high-dimensional one-to- one matched case-control data
http://academiccommons.columbia.edu/catalog/ac:174087
Zhou, Hui; Wang, Shuang; Zheng, Tianhttp://dx.doi.org/10.7916/D8028PNJMon, 12 May 2014 00:00:00 +0000Fit GFuseTLP penalized conditional logistic regression model for high-dimensional one-to- one matched case-control dataStatisticshz2240, sw2206, tz33Statistics, BiostatisticsComputer softwareUnderstanding the Nature of Stellar Chemical Abundance Distributions in Nearby Stellar Systems
http://academiccommons.columbia.edu/catalog/ac:173510
Lee, Duane Morrishttp://dx.doi.org/10.7916/D84747X6Fri, 25 Apr 2014 00:00:00 +0000Since stars retain signatures of their galactic origins in their chemical compositions, we can exploit the chemical abundance distributions that we observe in stellar systems to put constraints on the nature of their progenitors. In this thesis, I present results from three projects aimed at understanding how high resolution spectroscopic observations of nearby stellar systems might be interpreted. The first project presents one possible explanation for the origin of peculiar abundance distributions observed in ultra-faint dwarf satellites of the Milky Way. The second project explores to what extent the distribution of chemical elements in the stellar halo can be used to trace Galactic accretion history from the birth of the Galaxy to the present day. Finally, a third project focuses on developing an input optimization algorithm for the second project to produce better estimates of halo accretion histories. In conclusion, I propose some other new ways to use statistical models and techniques along with chemical abundance distribution data to uncover galactic histories.Astronomy, Statistics, Nuclear chemistryAstronomyDissertationsA Point Process Model for the Dynamics of Limit Order Books
http://academiccommons.columbia.edu/catalog/ac:171221
Vinkovskaya, Ekaterinahttp://dx.doi.org/10.7916/D88913WWFri, 28 Feb 2014 00:00:00 +0000This thesis focuses on the statistical modeling of the dynamics of limit order books in electronic equity markets. The statistical properties of events affecting a limit order book -market orders, limit orders and cancellations- reveal strong evidence of clustering in time, cross-correlation across event types and dependence of the order flow on the bid-ask spread. Further investigation reveals the presence of a self-exciting property - that a large number of events in a given time period tends to imply a higher probability of observing a large number of events in the following time period. We show that these properties may be adequately represented by a multivariate self-exciting point process with multiple regimes that reflect changes in the bid-ask spread. We propose a tractable parametrization of the model and perform a Maximum Likelihood Estimation of the model using high-frequency data from the Trades and Quotes database for US stocks. We show that the model may be used to obtain predictions of order flow and that its predictive performance beats the Poisson model as well as Moving Average and Auto Regressive time series models.StatisticsStatisticsDissertationsMixed Methods for Mixed Models
http://academiccommons.columbia.edu/catalog/ac:169644
Dorie, Vincent J.http://dx.doi.org/10.7916/D8V40S5XWed, 22 Jan 2014 00:00:00 +0000This work bridges the frequentist and Bayesian approaches to mixed models by borrowing the best features from both camps: point estimation procedures are combined with priors to obtain accurate, fast inference while posterior simulation techniques are developed that approximate the likelihood with great precision for the purposes of assessing uncertainty. These allow flexible inferences without the need to rely on expensive Markov chain Monte Carlo simulation techniques. Default priors are developed and evaluated in a variety of simulation and real-world settings with the end result that we propose a new set of standard approaches that yield superior performance at little computational cost.StatisticsStatisticsDissertationsMathematical Representations of Development Theories
http://academiccommons.columbia.edu/catalog/ac:168029
Singer, Burton; Spilerman, Seymour; Nesselroade, John R.; Boltes, Paul B.http://dx.doi.org/10.7916/D8NP22DSFri, 06 Dec 2013 00:00:00 +0000In this chapter we explore the consequences of particular stage linkage structures for the evolution of a population. We first argue the importance of constructing dynamic models of development theories and show the implications of various stage connections for population movements. A second focus concerns inverse problems: How the stage linkage structure may be recovered from survey data of the kind collected by developmental psychologists.Developmental psychology, Statisticsss50SociologyBook chaptersSample Palomar Transient Factory light curves
http://academiccommons.columbia.edu/catalog/ac:167874
Price-Whelan, Adrian M.; Agueros, Marcel; Fournier, Amanda P.; Street, Rachel; Ofek, Eran O.; Covey, Kevin R.; Levitan, David; Laher, Russ R.; Sesar, Branimir; Surace, Jasonhttp://dx.doi.org/10.7916/D8CF9N1NMon, 25 Nov 2013 00:00:00 +0000These light curves are made available to the public as part of the publication of our recent paper, "Statistical Searches for Microlensing Events in Large, Non-Uniformly Sampled Time-Domain Surveys: A Test Using Palomar Transient Factory Data." We have selected ~10,000 light curves from the Palomar Transient Factory database that can be used to test the various statistical tools described in the paper.Astronomy, Statisticsmaa17AstronomyDatasetsLearning to Believe in Sunspots
http://academiccommons.columbia.edu/catalog/ac:167710
Woodford, Michaelhttp://dx.doi.org/10.7916/D85X26VBMon, 25 Nov 2013 00:00:00 +0000An adaptive learning rule is exhibited for the Azariadis (1981) overlapping generations model of a monetary economy with multiple equilibria, under which the economy may converge to a stationary sunspot equilibrium, even if agents do not initially believe that outcomes are significantly different in different "sunspot" states. The type of learning rule studied is of the "stochastic approximation" form studied by Robbins and Monro (1951); methods for analyzing the convergence of this form of algorithm are presented that may be of use in many other contexts as well. Conditions are given under which convergence to a sunspot equilibrium occurs with probability one.Economics, Economic theory, Statisticsmw2230EconomicsArticlesProspect Theory as Efficient Perceptual Distortion
http://academiccommons.columbia.edu/catalog/ac:167407
Woodford, Michaelhttp://dx.doi.org/10.7916/D8T43R03Thu, 21 Nov 2013 00:00:00 +0000The paper proposes a theory of efficient perceptual distortions, in which the statistical relation between subjective perceptions and the objective state minimizes the error of the state estimate, subject to a constraint on information processing capacity. The theory is shown to account for observed limits to the accuracy of visual perception, and then postulated to apply to perception of options in economic choice situations as well. When applied to choice between lotteries, it implies reference-dependent valuations, and predicts both risk-aversion with respect to gains and risk-seeking with respect to losses, as in the prospect theory of Kahneman and Tversky (1979).Statistics, Economic theory, Sociologymw2230EconomicsArticlesTwo Papers of Financial Engineering Relating to the Risk of the 2007--2008 Financial Crisis
http://academiccommons.columbia.edu/catalog/ac:167143
Zhong, Haowenhttp://dx.doi.org/10.7916/D8CC0XMGFri, 15 Nov 2013 00:00:00 +0000This dissertation studies two financial engineering and econometrics problems relating to two facets of the 2007-2008 financial crisis. In the first part, we construct the Spatial Capital Asset Pricing Model and the Spatial Arbitrage Pricing Theory to characterize the risk premiums of futures contracts on real estate assets. We also provide rigorous econometric analysis of the new models. Empirical study shows there exists significant spatial interaction among the S&P/Case-Shiller Home Price Index futures returns. In the second part, we perform empirical studies on the jump risk in the equity market. We propose a simple affine jump-diffusion model for equity returns, which seems to outperform existing ones (including models with Levy jumps) during the financial crisis and is at least as good during normal times, if model complexity is taken into account. In comparing the models, we made two empirical findings: (i) jump intensity seems to increase significantly during the financial crisis, while on average there appears to be little change of jump sizes; (ii) finite number of large jumps in returns for any finite time horizon seem to fit the data well both before and after the crisis.Operations research, Statisticshz2193Industrial Engineering and Operations ResearchDissertationsKernel-based association measures
http://academiccommons.columbia.edu/catalog/ac:167034
Liu, Yinghttp://hdl.handle.net/10022/AC:P:22154Thu, 07 Nov 2013 00:00:00 +0000Measures of associations have been widely used for describing the statistical relationships between two sets of variables. Traditional association measures tend to focus on specialized settings (specific types of variables or association patterns). Based on an in-depth summary of existing measures, we propose a general framework for association measures unifying existing methods and novel extensions based on kernels, including practical solutions to computational challenges. The proposed framework provides improved feature selection and extensions to a variety of current classifiers. Specifically, we introduce association screening and variable selection via maximizing kernel-based association measures. We also develop a backward dropping procedure for feature selection when there are a large number of candidate variables. We evaluate our framework using a wide variety of both simulated and real data. In particular, we conduct independence tests and feature selection using kernel association measures on diversified association patterns of different dimensions and variable types. The results show the superiority of our methods to existing ones. We also apply our framework to four real-word problems, three from statistical genetics and one of gender prediction from handwriting. We demonstrate through these applications both the de novo construction of new kernels and the adaptation of existing kernels tailored to the data at hand, and how kernel-based measures of associations can be naturally applied to different data structures including functional input and output spaces. This shows that our framework can be applied to a wide range of real world problems and work well in practice.Statistics, Computer scienceyl2802StatisticsDissertationsInference of functional neural connectivity and convergence acceleration methods
http://academiccommons.columbia.edu/catalog/ac:179409
Nikitchenko, Maxim V.http://hdl.handle.net/10022/AC:P:22052Thu, 31 Oct 2013 00:00:00 +0000The knowledge of the maps of neuronal interactions is key for system neuroscience, but at the moment we possess relatively little of it . The recent development of experimental methods which allow a simultaneous recording of the spiking activity, but not the intracellular voltage, of thousands of neurons gives us an opportunity to start filling that gap. In Chapter 2, I present a method for the inference of the parameters of the leaky integrate-and-fire (LIF) model featuring time-dependent currents and conductances based only on the extracellular recording of spiking in the network. The fitted parameters can describe the functional connections in the network, as well as the internal properties of the cells. The method can also be used to determine whether a single-compartment model of a neuron should include conductance- or current-based synapses, or their mixture. In addition, because the same mathematical model describes some of the flavors of the Drift Diffusion Model (DDM), popular in the studies of decision making process, the presented method can be readily used to fit their parameters. Making the proposed inference procedure -- based on the expectation-maximization (EM) algorithm -- accurate and robust, necessitated a development of a new numerical adaptive-grid (AG) method for the forward-backward (FB) propagation of the probability density, which is required in the computation of the sufficient statistic in the EM algorithm. These topics are covered in Chapter 3. Another issue which had to be addressed in order to obtain a usable inference algorithm is the well known slow convergence of the EM algorithm in the flat regions of the loglikelihood. Two complementary approaches to this issue are presented in this dissertation. In Chapter 4, I present a new framework for the acceleration of convergence of iterative algorithms (not limited to the EM) which unifies all previously known methods and allows us to construct a new method demonstrating the best performance of them all. To make the computations even faster, I wrote a Matlab package which allows them to be done in parallel on several machines and clusters. As one can see, all the aforementioned projects were sprouted up from one "head" project on the inference of the LIF model parameters. At the end of the dissertation, I briefly describe a disconnected project which is devoted to the development of a flexible experimental setup (software and hardware) for behavioral experiments, with a specific application to a particular type of the virtual Morris water maze experiment (VMWM).Neurosciences, Statisticsmvn2104Statistics, Neurobiology and BehaviorDissertationsLow-rank graphical models and Bayesian inference in the statistical analysis of noisy neural data
http://academiccommons.columbia.edu/catalog/ac:166472
Smith, Carl Alexanderhttp://hdl.handle.net/10022/AC:P:21991Fri, 11 Oct 2013 00:00:00 +0000We develop new methods of Bayesian inference, largely in the context of analysis of neuroscience data. The work is broken into several parts. In the first part, we introduce a novel class of joint probability distributions in which exact inference is tractable. Previously it has been difficult to find general constructions for models in which efficient exact inference is possible, outside of certain classical cases. We identify a class of such models that are tractable owing to a certain "low-rank" structure in the potentials that couple neighboring variables. In the second part we develop methods to quantify and measure information loss in analysis of neuronal spike train data due to two types of noise, making use of the ideas developed in the first part. Information about neuronal identity or temporal resolution may be lost during spike detection and sorting, or precision of spike times may be corrupted by various effects. We quantify the information lost due to these effects for the relatively simple but sufficiently broad class of Markovian model neurons. We find that decoders that model the probability distribution of spike-neuron assignments significantly outperform decoders that use only the most likely spike assignments. We also apply the ideas of the low-rank models from the first section to defining a class of prior distributions over the space of stimuli (or other covariate) which, by conjugacy, preserve the tractability of inference. In the third part, we treat Bayesian methods for the estimation of sparse signals, with application to the locating of synapses in a dendritic tree. We develop a compartmentalized model of the dendritic tree. Building on previous work that applied and generalized ideas of least angle regression to obtain a fast Bayesian solution to the resulting estimation problem, we describe two other approaches to the same problem, one employing a horseshoe prior and the other using various spike-and-slab priors. In the last part, we revisit the low-rank models of the first section and apply them to the problem of inferring orientation selectivity maps from noisy observations of orientation preference. The relevant low-rank model exploits the self-conjugacy of the von Mises distribution on the circle. Because the orientation map model is loopy, we cannot do exact inference on the low-rank model by the forward backward algorithm, but block-wise Gibbs sampling by the forward backward algorithm speeds mixing. We explore another von Mises coupling potential Gibbs sampler that proves to effectively smooth noisily observed orientation maps.Statistics, Neurosciencescas2207Statistics, ChemistryDissertationsGeneralized Volatility-Stabilized Processes
http://academiccommons.columbia.edu/catalog/ac:165162
Pickova, Radkahttp://hdl.handle.net/10022/AC:P:21616Fri, 13 Sep 2013 00:00:00 +0000In this thesis, we consider systems of interacting diffusion processes which we call Generalized Volatility-Stabilized processes, as they extend the Volatility-Stabilized Market models introduced in Fernholz and Karatzas (2005). First, we show how to construct a weak solution of the underlying system of stochastic differential equations. In particular, we express the solution in terms of time-changed squared-Bessel processes and argue that this solution is unique in distribution. In addition, we also discuss sufficient conditions under which this solution does not explode in finite time, and provide sufficient conditions for pathwise uniqueness and for existence of a strong solution. Secondly, we discuss the significance of these processes in the context of Stochastic Portfolio Theory. We describe specific market models which assume that the dynamics of the stocks' capitalizations is the same as that of the Generalized Volatility-Stabilized processes, and we argue that strong relative arbitrage opportunities may exist in these markets, specifically, we provide multiple examples of portfolios that outperform the market portfolio. Moreover, we examine the properties of market weights as well as the diversity weighted portfolio in these models. Thirdly, we provide some asymptotic results for these processes which allows us to describe different properties of the corresponding market models based on these processes.Statisticsrp2424Statistics, MathematicsDissertationsThe Representation of Social Processes by Markov Models
http://academiccommons.columbia.edu/catalog/ac:165054
Singer, Burton; Spilerman, Seymourhttp://hdl.handle.net/10022/AC:P:21574Thu, 12 Sep 2013 00:00:00 +0000In this paper we consider a class of issues which are central to modeling social phenomena by continuous-time Markov structures. In particular, we discuss (a) embeddability, or how to determine whether observations on an empirical process could have arisen via the evolution of a continuous-time Markov structure; and (b) identification, or what to do if the observations are consistent with more than one continuous-time Markov structure. With respect to the latter topic, we discuss how to select the specific structure from the list of alternatives which should be associated with the empirical process. We point out that the issues of embeddability and identification are especially pertinent to modeling empirical processes when one has available only fragmentary data and when the observations contain "noise" or other sources of error. These characteristics, of course, describe the typical work situation of sociologists. Finally, we note the type of situation in which a continuous-time model is the proper structure to employ and indicate that issues analogous to the ones we describe here apply to modeling social processes with discrete-time structures.Sociology, Statisticsss50SociologyArticlesThe Cognitive and Demographic Variables that Underlie Notetaking and Review in Mathematics: Does Quality of Notes Predict Test Performance in Mathematics?
http://academiccommons.columbia.edu/catalog/ac:163324
Belanfante, Elizabeth Andreahttp://hdl.handle.net/10022/AC:P:21089Tue, 16 Jul 2013 00:00:00 +0000Taking and reviewing lecture notes is an effective and prevalent method of studying employed by students at the post-secondary level (Armbruster, 2000; Armbruster, 2009; Dunkel and Davy, 1989; Peverly et al., 2009). However, few studies have examined the cognitive variables that underlie this skill. In addition, these studies have focused on more verbally based domains, such as history and psychology. The current study examined the practical utility of notes in actual class settings. It is the first study that has attempted to examine the outcomes and cognitive skills associated with note-taking and review in any area of mathematics. It also set out to establish the importance of quality of notes and quality of review sheets to test performance in graduate level probability and statistics courses. Finally, this dissertation sought to explore the extent to which variables besides notes also contribute to test performance in this domain. Participants included 74 graduate students enrolled in introductory probability and statistics courses at a private graduate teacher education college in a large city in the Northeast United States. Participants took notes during class and provided the researcher with a copy of their notes for several lectures. Participants were also required to write down additional information on the back of two formula sheets that were used as an aid on the midterm exam. The independent variables included handwriting speed, gender, spatial visualization ability, background knowledge, verbal ability, and working memory. The dependent variables were quality of lecture notes, quality of supplemental review sheets, and midterm performance. All measures were group administered. Results revealed that gender was the only predictor of quality of lecture notes. Quality of lecture notes was the only significant predictor of quality of supplemental review sheets. Neither quality of lecture notes nor quality of supplemental review sheets predicted overall test performance. Instead, background knowledge and instructor significantly predicted overall test performance. Handwriting speed was a marginally significant predictor of overall test performance. Future research aimed at replicating these findings and expanding the results to include other mathematical domains and educational levels is recommended.Mathematics, Statistics, Educationeab2111Health and Behavior Studies, School PsychologyDissertationsApplication of ordered latent class regression model in educational assessment
http://academiccommons.columbia.edu/catalog/ac:161911
Cha, Jisunghttp://hdl.handle.net/10022/AC:P:20599Thu, 06 Jun 2013 00:00:00 +0000Latent class analysis is a useful tool to deal with discrete multivariate response data. Croon (1990) proposed the ordered latent class model where latent classes are ordered by imposing inequality constraints on the cumulative conditional response probabilities. Taking stochastic ordering of latent classes into account in the analysis of data gives a meaningful interpretation, since the primary purpose of a test is to order students on the latent trait continuum. This study extends Croon's model to ordered latent class regression that regresses latent class membership on covariates (e.g., gender, country) and demonstrates the utilities of an ordered latent class regression model in educational assessment using data from Trends in International Mathematics and Science Study (TIMSS). The benefit of this model is that item analysis and group comparisons can be done simultaneously in one model. The model is fitted by maximum likelihood estimation method with an EM algorithm. It is found that the proposed model is a useful tool for exploratory purposes as a special case of nonparametric item response models and cross-country difference can be modeled as different composition of discrete classes. Simulations is done to evaluate the performance of information criteria (AIC and BIC) in selecting the appropriate number of latent classes in the model. From the simulation results, AIC outperforms BIC for the model with the order-restricted maximum likelihood estimator.Educational tests and measurements, Statistics, Mathematics educationjc2320Human Development, Measurement and EvaluationDissertationsPenalized Joint Maximum Likelihood Estimation Applied to Two Parameter Logistic Item Response Models
http://academiccommons.columbia.edu/catalog/ac:161745
Paolino, Jon-Paul Noelhttp://hdl.handle.net/10022/AC:P:20531Fri, 31 May 2013 00:00:00 +0000Item response theory (IRT) models are a conventional tool for analyzing both small scale and large scale educational data sets, and they are also used for the development of high-stakes tests such as the Scholastic Aptitude Test (SAT) and the Graduate Record Exam (GRE). When estimating these models it is imperative that the data set includes many more examinees than items, which is a similar requirement in regression modeling where many more observations than variables are needed. If this requirement has not been met the analysis will yield meaningless results. Recently, penalized estimation methods have been developed to analyze data sets that may include more variables than observations. The main focus of this study was to apply LASSO and ridge regression penalization techniques to IRT models in order to better estimate model parameters. The results of our simulations showed that this new estimation procedure called penalized joint maximum likelihood estimation provided meaningful estimates when IRT data sets included more items than examinees when traditional Bayesian estimation and marginal maximum likelihood methods were not appropriate. However, when the IRT datasets contained more examinees than items Bayesian estimation clearly outperformed both penalized joint maximum likelihood estimation and marginal maximum likelihood.Statisticsjnp2111Human Development, Measurement and EvaluationDissertationsStochastic Models of Limit Order Markets
http://academiccommons.columbia.edu/catalog/ac:161685
Kukanov, Arseniyhttp://hdl.handle.net/10022/AC:P:20511Thu, 30 May 2013 00:00:00 +0000During the last two decades most stock and derivatives exchanges in the world transitioned to electronic trading in limit order books, creating a need for a new set of quantitative models to describe these order-driven markets. This dissertation offers a collection of models that provide insight into the structure of modern financial markets, and can help to optimize trading decisions in practical applications. In the first part of the thesis we study the dynamics of prices, order flows and liquidity in limit order markets over short timescales. We propose a stylized order book model that predicts a particularly simple linear relation between price changes and order flow imbalance, defined as a difference between net changes in supply and demand. The slope in this linear relation, called a price impact coefficient, is inversely proportional in our model to market depth - a measure of liquidity. Our empirical results confirm both of these predictions. The linear relation between order flow imbalance and price changes holds for time intervals between 50 milliseconds and 5 minutes. The inverse relation between the price impact coefficient and market depth holds on longer timescales. These findings shed a new light on intraday variations in market volatility. According to our model volatility fluctuates due to changes in market depth or in order flow variance. Previous studies also found a positive correlation between volatility and trading volume, but in order-driven markets prices are determined by the limit order book activity, so the association between trading volume and volatility is unclear. We show how a spurious correlation between these variables can indeed emerge in our linear model due to time aggregation of high-frequency data. Finally, we observe short-term positive autocorrelation in order flow imbalance and discuss an application of this variable as a measure of adverse selection in limit order executions. Our results suggest that monitoring recent order flow can improve the quality of order executions in practice. In the second part of the thesis we study the problem of optimal order placement in a fragmented limit order market. To execute a trade, market participants can submit limit orders or market orders across various exchanges where a stock is traded. In practice these decisions are influenced by sizes of order queues and by statistical properties of order flows in each limit order book, and also by rebates that exchanges pay for limit order submissions. We present a realistic model of limit order executions and formalize the search for an optimal order placement policy as a convex optimization problem. Based on this formulation we study how various factors determine investor's order placement decisions. In a case when a single exchange is used for order execution, we derive an explicit formula for the optimal limit and market order quantities. Our solution shows that the optimal split between market and limit orders largely depends on one's tolerance to execution risk. Market orders help to alleviate this risk because they execute with certainty. Correspondingly, we find that an optimal order allocation shifts to these more expensive orders when the execution risk is of primary concern, for example when the intended trade quantity is large or when it is costly to catch up on the quantity after limit order execution fails. We also characterize the optimal solution in the general case of simultaneous order placement on multiple exchanges, and show that it sets execution shortfall probabilities to specific threshold values computed with model parameters. Finally, we propose a non-parametric stochastic algorithm that computes an optimal solution by resampling historical data and does not require specifying order flow distributions. A numerical implementation of this algorithm is used to study the sensitivity of an optimal solution to changes in model parameters. Our numerical results show that order placement optimization can bring a substantial reduction in trading costs, especially for small orders and in cases when order flows are relatively uncorrelated across trading venues. The order placement optimization framework developed in this thesis can also be used to quantify the costs and benefits of financial market fragmentation from the point of view of an individual investor. For instance, we find that a positive correlation between order flows, which is empirically observed in a fragmented U.S. equity market, increases the costs of trading. As the correlation increases it may become more expensive to trade in a fragmented market than it is in a consolidated market. In the third part of the thesis we analyze the dynamics of limit order queues at the best bid or ask of an exchange. These queues consist of orders submitted by a variety of market participants, yet existing order book models commonly assume that all orders have similar dynamics. In practice, some orders are submitted by trade execution algorithms in an attempt to buy or sell a certain quantity of assets under time constraints, and these orders are canceled if their realized waiting time exceeds a patience threshold. In contrast, high-frequency traders submit and cancel orders depending on the order book state and their orders are not driven by patience. The interaction between these two order types within a single FIFO queue leads bursts of order cancelations for small queues and anomalously long waiting times in large queues. We analyze a fluid model that describes the evolution of large order queues in liquid markets, taking into account the heterogeneity between order submission and cancelation strategies of different traders. Our results show that after a finite initial time interval, the queue reaches a specific structure where all orders from high-frequency traders stay in the queue until execution but most orders from execution algorithms exceed their patience thresholds and are canceled. This "order crowding" effect has been previously noted by participants in highly liquid stock and futures markets and was attributed to a large participation of high-frequency traders. In our model, their presence creates an additional workload, which increases queue waiting times for new orders. Our analysis of the fluid model leads to waiting time estimates that take into account the distribution of order types in a queue. These estimates are tested against a large dataset of realized limit order waiting times collected by a U.S. equity brokerage firm. The queue composition at a moment of order submission noticeably affects its waiting time and we find that assuming a single order type for all orders in the queue leads to unrealistic results. Estimates that assume instead a mix of heterogeneous orders in the queue are closer to empirical data. Our model for a limit order queue with heterogeneous order types also appears to be interesting from a methodological point of view. It introduces a new type of behavior in a queueing system where one class of jobs has state-dependent dynamics, while others are driven by patience. Although this model is motivated by the analysis of limit order books, it may find applications in studying other service systems with state-dependent abandonments.Operations research, Finance, Statisticsak2870Industrial Engineering and Operations ResearchDissertationsCredit Risk Modeling and Analysis Using Copula Method and Changepoint Approach to Survival Data
http://academiccommons.columbia.edu/catalog/ac:161682
Qian, Bohttp://hdl.handle.net/10022/AC:P:20510Thu, 30 May 2013 00:00:00 +0000This thesis consists of two parts. The first part uses Gaussian Copula and Student's t Copula as the main tools to model the credit risk in securitizations and re-securitizations. The second part proposes a statistical procedure to identify changepoints in Cox model of survival data. The recent 2007-2009 financial crisis has been regarded as the worst financial crisis since the Great Depression by leading economists. The securitization sector took a lot of blame for the crisis because of the connection of the securitized products created from mortgages to the collapse of the housing market. The first part of this thesis explores the relationship between securitized mortgage products and the 2007-2009 financial crisis using the Copula method as the main tool. We show in this part how loss distributions of securitizations and re-securitizations can be derived or calculated in a new model. Simulations are conducted to examine the effectiveness of the model. As an application, the model is also used to examine whether and where the ratings of securitized products could be flawed. On the other hand, the lag effect and saturation effect problems are common and important problems in survival data analysis. They belong to a general class of problems where the treatment effect takes occasional jumps instead of staying constant throughout time. Therefore, they are essentially the changepoint problems in statistics. The second part of this thesis focuses on extending Lai and Xing's recent work in changepoint modeling, which was developed under a time series and Bayesian setup, to the lag effect problems in survival data. A general changepoint approach for Cox model is developed. Simulations and real data analyses are conducted to illustrate the effectiveness of the procedure and how it should be implemented and interpreted.Statisticsbq2102StatisticsDissertationsOn the relationship between total ozone and atmospheric dynamics and chemistry at mid-latitudes – Part 1: Statistical models and spatial fingerprints of atmospheric dynamics and chemistry
http://academiccommons.columbia.edu/catalog/ac:161210
Frossard, L.; Rieder, Harald; Ribatet, M.; Staehelin, J.; Maeder, J. A.; Di Rocco, S.; Davison, A. C.; Peter, T.http://hdl.handle.net/10022/AC:P:20344Thu, 16 May 2013 00:00:00 +0000We use statistical models for mean and extreme values of total column ozone to analyze "fingerprints" of atmospheric dynamics and chemistry on long-term ozone changes at northern and southern mid-latitudes on grid cell basis. At each grid cell, the r-largest order statistics method is used for the analysis of extreme events in low and high total ozone (termed ELOs and EHOs, respectively), and an autoregressive moving average (ARMA) model is used for the corresponding mean value analysis. In order to describe the dynamical and chemical state of the atmosphere, the statistical models include important atmospheric covariates: the solar cycle, the Quasi-Biennial Oscillation (QBO), ozone depleting substances (ODS) in terms of equivalent effective stratospheric chlorine (EESC), the North Atlantic Oscillation (NAO), the Antarctic Oscillation (AAO), the El Niño/Southern Oscillation (ENSO), and aerosol load after the volcanic eruptions of El Chichón and Mt. Pinatubo. The influence of the individual covariates on mean and extreme levels in total column ozone is derived on a grid cell basis. The results show that "fingerprints", i.e., significant influence, of dynamical and chemical features are captured in both the "bulk" and the tails of the statistical distribution of ozone, respectively described by mean values and EHOs/ELOs. While results for the solar cycle, QBO, and EESC are in good agreement with findings of earlier studies, unprecedented spatial fingerprints are retrieved for the dynamical covariates. Column ozone is enhanced over Labrador/Greenland, the North Atlantic sector and over the Norwegian Sea, but is reduced over Europe, Russia and the Eastern United States during the positive NAO phase, and vice-versa during the negative phase. The NAO's southern counterpart, the AAO, strongly influences column ozone at lower southern mid-latitudes, including the southern parts of South America and the Antarctic Peninsula, and the central southern mid-latitudes. Results for both NAO and AAO confirm the importance of atmospheric dynamics for ozone variability and changes from local/regional to global scales.Statistics, Atmospheric chemistry, Atmospheric scienceshr2302Lamont-Doherty Earth ObservatoryArticlesBayesian Multidimensional Scaling Model for Ordinal Preference Data
http://academiccommons.columbia.edu/catalog/ac:161114
Matlosz, Kerry McCloskeyhttp://hdl.handle.net/10022/AC:P:20304Tue, 14 May 2013 00:00:00 +0000The model within the present study incorporated Bayesian Multidimensional Scaling and Markov Chain Monte Carlo methods to represent individual preferences and threshold parameters as they relate to the influence of survey items popularity and their interrelationships. The model was used to interpret two independent data samples of ordinal consumer preference data related to purchasing behavior. The objective of the procedure was to provide an understanding and visual depiction of consumers' likelihood of having a strong affinity toward one of the survey choices, and how other survey choices relate to it. The study also aimed to derive the joint spatial representation of the subjects and products represented by the dissimilarity preference data matrix within a reduced dimensionality. This depiction would aim to enable interpretation of the preference structure underlying the data and potential demand for each product. Model simulations were created both from sampling the normal distribution, as well as incorporating Lambda values from the two data sets and were analyzed separately. Posterior checks were used to determine dimensionality, which were also confirmed within the simulation procedures. The statistical properties generated from the simulated data confirmed that the true parameter values (loadings, utilities, and latititudes) were recovered. The model effectiveness was contrasted and evaluated both within real data samples and a simulated data set. The two data sets analyzed were confirmed to have differences in their underlying preference structures that resulted in differences in the optimal dimensionality in which the data should be represented. The Biases and MSEs of the lambdas and alphas provide further understanding of the data composition and Analysis of variance (ANOVA) confirmed the differences in MSEs related to changes in dimensions were statistically significant.Statisticskmm2159Human Development, Measurement and EvaluationDissertationsNonlinear penalized estimation of true Q-matrix in cognitive diagnostic models
http://academiccommons.columbia.edu/catalog/ac:160812
Xiang, Ruihttp://hdl.handle.net/10022/AC:P:20149Wed, 01 May 2013 00:00:00 +0000A key issue of cognitive diagnostic models (CDMs) is the correct identification of Q-matrix which indicates the relationship between attributes and test items. Previous CDMs typically assumed a known Q-matrix provided by domain experts such as those who developed the questions. However, misspecifications of Q-matrix had been discovered in the past studies. The primary purpose of this research is to set up a mathematical framework to estimate the true Q-matrix based on item response data. The model considers all Q-matrix elements as parameters and estimates them through EM algorithm. Two simulation designs are conducted to evaluate the feasibility and performance of the model. An empirical study is addressed to compare the estimated Q-matrix with the one designed by experts. The results show that the model performs well and is able to identify 60% to 90% of correct elements of Q-matrix. The model also indicates possible misspecifications of the designed Q-matrix in the fraction subtraction test.Statistics, Education, Psychologyrx2107Human Development, Measurement and EvaluationDissertationsStatistical Inference for Diagnostic Classification Models
http://academiccommons.columbia.edu/catalog/ac:160464
Xu, Gongjunhttp://hdl.handle.net/10022/AC:P:20058Tue, 30 Apr 2013 00:00:00 +0000Diagnostic classification models (DCM) are an important recent development in educational and psychological testing. Instead of an overall test score, a diagnostic test provides each subject with a profile detailing the concepts and skills (often called "attributes") that he/she has mastered. Central to many DCMs is the so-called Q-matrix, an incidence matrix specifying the item-attribute relationship. It is common practice for the Q-matrix to be specified by experts when items are written, rather than through data-driven calibration. Such a non-empirical approach may lead to misspecification of the Q-matrix and substantial lack of model fit, resulting in erroneous interpretation of testing results. This motivates our study and we consider the identifiability, estimation, and hypothesis testing of the Q-matrix. In addition, we study the identifiability of diagnostic model parameters under a known Q-matrix. The first part of this thesis is concerned with estimation of the Q-matrix. In particular, we present definitive answers to the learnability of the Q-matrix for one of the most commonly used models, the DINA model, by specifying a set of sufficient conditions under which the Q-matrix is identifiable up to an explicitly defined equivalence class. We also present the corresponding data-driven construction of the Q-matrix. The results and analysis strategies are general in the sense that they can be further extended to other diagnostic models. The second part of the thesis focuses on statistical validation of the Q-matrix. The purpose of this study is to provide a statistical procedure to help decide whether to accept the Q-matrix provided by the experts. Statistically, this problem can be formulated as a pure significance testing problem with null hypothesis H0 : Q = Q0, where Q0 is the candidate Q-matrix. We propose a test statistic that measures the consistency of observed data with the proposed Q-matrix. Theoretical properties of the test statistic are studied. In addition, we conduct simulation studies to show the performance of the proposed procedure. The third part of this thesis is concerned with the identifiability of the diagnostic model parameters when the Q-matrix is correctly specified. Identifiability is a prerequisite for statistical inference, such as parameter estimation and hypothesis testing. We present sufficient and necessary conditions under which the model parameters are identifiable from the response data.Statistics, Educational tests and measurementsgx2108StatisticsDissertationsOptimization Algorithms for Structured Machine Learning and Image Processing Problems
http://academiccommons.columbia.edu/catalog/ac:158764
Qin, Zhiweihttp://hdl.handle.net/10022/AC:P:19648Fri, 05 Apr 2013 00:00:00 +0000Optimization algorithms are often the solution engine for machine learning and image processing techniques, but they can also become the bottleneck in applying these techniques if they are unable to cope with the size of the data. With the rapid advancement of modern technology, data of unprecedented size has become more and more available, and there is an increasing demand to process and interpret the data. Traditional optimization methods, such as the interior-point method, can solve a wide array of problems arising from the machine learning domain, but it is also this generality that often prevents them from dealing with large data efficiently. Hence, specialized algorithms that can readily take advantage of the problem structure are highly desirable and of immediate practical interest. This thesis focuses on developing efficient optimization algorithms for machine learning and image processing problems of diverse types, including supervised learning (e.g., the group lasso), unsupervised learning (e.g., robust tensor decompositions), and total-variation image denoising. These algorithms are of wide interest to the optimization, machine learning, and image processing communities. Specifically, (i) we present two algorithms to solve the Group Lasso problem. First, we propose a general version of the Block Coordinate Descent (BCD) algorithm for the Group Lasso that employs an efficient approach for optimizing each subproblem exactly. We show that it exhibits excellent performance when the groups are of moderate size. For groups of large size, we propose an extension of the proximal gradient algorithm based on variable step-lengths that can be viewed as a simplified version of BCD. By combining the two approaches we obtain an implementation that is very competitive and often outperforms other state-of-the-art approaches for this problem. We show how these methods fit into the globally convergent general block coordinate gradient descent framework in (Tseng and Yun, 2009). We also show that the proposed approach is more efficient in practice than the one implemented in (Tseng and Yun, 2009). In addition, we apply our algorithms to the Multiple Measurement Vector (MMV) recovery problem, which can be viewed as a special case of the Group Lasso problem, and compare their performance to other methods in this particular instance; (ii) we further investigate sparse linear models with two commonly adopted general sparsity-inducing regularization terms, the overlapping Group Lasso penalty l1/l2-norm and the l1/l_infty-norm. We propose a unified framework based on the augmented Lagrangian method, under which problems with both types of regularization and their variants can be efficiently solved. As one of the core building-blocks of this framework, we develop new algorithms using a partial-linearization/splitting technique and prove that the accelerated versions of these algorithms require $O(1 sqrt(epsilon) ) iterations to obtain an epsilon-optimal solution. We compare the performance of these algorithms against that of the alternating direction augmented Lagrangian and FISTA methods on a collection of data sets and apply them to two real-world problems to compare the relative merits of the two norms; (iii) we study the problem of robust low-rank tensor recovery in a convex optimization framework, drawing upon recent advances in robust Principal Component Analysis and tensor completion. We propose tailored optimization algorithms with global convergence guarantees for solving both the constrained and the Lagrangian formulations of the problem. These algorithms are based on the highly efficient alternating direction augmented Lagrangian and accelerated proximal gradient methods. We also propose a nonconvex model that can often improve the recovery results from the convex models. We investigate the empirical recoverability properties of the convex and nonconvex formulations and compare the computational performance of the algorithms on simulated data. We demonstrate through a number of real applications the practical effectiveness of this convex optimization framework for robust low-rank tensor recovery; (iv) we consider the image denoising problem using total variation regularization. This problem is computationally challenging to solve due to the non-differentiability and non-linearity of the regularization term. We propose a new alternating direction augmented Lagrangian method, involving subproblems that can be solved efficiently and exactly. The global convergence of the new algorithm is established for the anisotropic total variation model. We compare our method with the split Bregman method and demonstrate the superiority of our method in computational performance on a set of standard test images.Operations research, Computer science, Statisticszq2107Industrial Engineering and Operations ResearchDissertationsAnalyzing Postdisaster Surveillance Data: The Effect of the Statistical Method
http://academiccommons.columbia.edu/catalog/ac:157474
DiMaggio, Charles J.; Galea, Sandro; Abramson, David M.http://hdl.handle.net/10022/AC:P:19291Thu, 07 Mar 2013 00:00:00 +0000Data from existing administrative databases and ongoing surveys or surveillance methods may prove indispensable after mass traumas as a way of providing information that may be useful to emergency planners and practitioners. The analytic approach, however, may affect exposure prevalence estimates and measures of association. We compare Bayesian hierarchical modeling methods to standard survey analytic techniques for survey data collected in the aftermath of a terrorist attack. Estimates for the prevalence of exposure to the terrorist attacks of September 11, 2001, varied by the method chosen. Bayesian hierarchical modeling returned the lowest estimate for exposure prevalence with a credible interval spanning nearly 3 times the range of the confidence intervals (CIs) associated with both unadjusted and survey procedures. Bayesian hierarchical modeling also returned a smaller point estimate for measures of association, although in this instance the credible interval was tighter than that obtained through survey procedures. Bayesian approaches allow a consideration of preexisting assumptions about survey data, and may offer potential advantages, particularly in the uncertain environment of postterrorism and disaster settings. Additional comparative analyses of existing data are necessary to guide our ability to use these techniques in future incidents.Public health, Statisticscjd11, sg822, dma3Epidemiology, National Center for Disaster Preparedness, Sociomedical Sciences, AnesthesiologyArticlesBayesian Model Selection in terms of Kullback-Leibler discrepancy
http://academiccommons.columbia.edu/catalog/ac:158374
Zhou, Shouhaohttp://hdl.handle.net/10022/AC:P:19157Mon, 25 Feb 2013 00:00:00 +0000In this article we investigate and develop the practical model assessment and selection methods for Bayesian models, when we anticipate that a promising approach should be objective enough to accept, easy enough to understand, general enough to apply, simple enough to compute and coherent enough to interpret. We mainly restrict attention to the Kullback-Leibler divergence, a widely applied model evaluation measurement to quantify the similarity between the proposed candidate model and the underlying true model, where the true model is only referred to a probability distribution as the best projection onto the statistical modeling space once we try to understand the real but unknown dynamics/mechanism of interest. In addition to review and discussion on the advantages and disadvantages of the historically and currently prevailing practical model selection methods in literature, a series of convenient and useful tools, each designed and applied for different purposes, are proposed to asymptotically unbiasedly assess how the candidate Bayesian models are favored in terms of predicting a future independent observation. What's more, we also explore the connection of the Kullback-Leibler based information criterion to the Bayes factors, another most popular Bayesian model comparison approaches, after seeing the motivation through the developments of the Bayes factor variants. In general, we expect to provide a useful guidance for researchers who are interested in conducting Bayesian data analysis.Statisticssz2020StatisticsDissertationsUse of External Representations in Reasoning about Causality
http://academiccommons.columbia.edu/catalog/ac:155648
Mason, Davidhttp://hdl.handle.net/10022/AC:P:18786Wed, 23 Jan 2013 00:00:00 +0000This research investigated if diagrams aid in deductive reasoning with formal causal models. Four studies were conducted exploring participants' ability to discover causal paths, identify causes and effects, and create alternative explanations for variable relationships. In Study 1, abstract variables of the causal model were compared to contextually grounded variables and causal models presented as text or diagrams were compared. Participants given abstract diagrams did better in most tasks than participants in the other conditions, who all did similarly. Studies 2 and 3 compared causal models expressed in text to diagrammed causal models, and compared models using arrows to models using words when connecting variables. Participants who had arrowheads replaced with words made more errors than participants in other diagram conditions. Diagrammed causal models led to better performance than did other conditions, and there was no difference between different text models. Studies 4 and 5 tested the hypothesis that predictive reasoning (from cause to effect) is easier than diagnostic reasoning (from effect to cause). The two studies did not find any such effectCognitive psychology, Statisticsdlm2153Human Development, PsychologyDissertationsTropical Cyclone Risk Assessment Using Statistical Models
http://academiccommons.columbia.edu/catalog/ac:168327
Yonekura, EmmiFri, 14 Dec 2012 00:00:00 +0000Tropical cyclones (TC) in the western North Pacific (WNP) pose a serious threat to the coastal regions of Eastern Asia when they make landfall. The limited amount of observational data and the high computational cost of running TC-permitting dynamical models indicate a need for statistical models that can simulate large ensembles of TCs in order to cover the full range of possible activity that results from a given climate change. I construct and apply a statistical track model from the 1945-2007 observed "best tracks" in the IBTrACS database for the WNP. The lifecycle components--genesis, track propagation, and death--of each simulated track is determined stochastically based on the statistics of historical occurrences. The length scale that dictates what historical data to consider as "local" for each lifecycle component is calculated objectively through optimization. Overall, I demonstrate how a statistical model can be used as a tool to translate climate-induced changes in TC activity into implications for risk. In contrast to other studies, I show that the El Niño/Southern Oscillation (ENSO) has an effect on track propagation separate from the genesis effect. The ENSO effect on genesis results in higher landfall rates during La Niña years due to the shift in genesis location to the northeast. The effect on tracks is more geographically and seasonally varied due to local changes in the mid-level winds. I use local regression techniques to capture the relationship between ENSO, cyclogenesis, and track propagation. Stationary climate simulations are run for extreme ENSO states in order to better understand changes in TC activity and their implication for regional landfall. Additionally, Several diagnostics are performed on model realizations of the historical period, confirming the model's ability to capture the geographical distribution and interannual variability of observed TCs. Lastly, as a step to connect synthetic TC track simulations to economic damage risk assessment, I use a Damage Index and total damage data for U.S. landfalling hurricanes and fit generalized Pareto distributions (GPD) to them. The Damage Index uniquely separates out the effects of the physical damage capacity of a TC and the local economic vulnerability of a coastal region. GPD fits are also performed using covariates in the scale parameter, where bathymetric slope and landfall intensity are found to be useful covariates for the Damage Index. Using the Damage Index with covariates model, two examples are shown of assessing damage risk for different climates. The first takes landfall data input from a statistical-deterministic TC model downscaled from GFDL and ECHAM model current and future climates. The second takes landfall data from a fully statistical track model with different values of relative sea surface temperature given as a statistical predictor.Atmospheric sciences, Statistics, Climate changeey2111Applied Physics and Applied Mathematics, Goddard Institute for Space Studies, Earth and Environmental SciencesDissertationsMultiplicative Multiresolution Analysis for Lie-group Valued Data Indexed by a Euclidean Parameter
http://academiccommons.columbia.edu/catalog/ac:155756
Stodden, Victoria C. http://hdl.handle.net/10022/AC:P:15397Wed, 12 Dec 2012 00:00:00 +0000Lie-valued euclidean indexed data. These data might be: phase angles as functions of time or space, for example compass directions; 3D orientations of a rigid frame of reference as a function of time or space; or, quaternions as a function of time or space. This can also be extended to quotients of lie groups which gives us the ability to model points on S2, the unit sphere, as functions of time or space.Computer science, Statisticsvcs2115StatisticsPresentationsTropical Cyclone Risk Assessment Using Statistical Models
http://academiccommons.columbia.edu/catalog/ac:167904
Yonekura, EmmiMon, 26 Nov 2012 00:00:00 +0000Tropical cyclones (TC) in the western North Pacific (WNP) pose a serious threat to the coastal regions of Eastern Asia when they make landfall. The limited amount of observational data and the high computational cost of running TC-permitting dynamical models indicate a need for statistical models that can simulate large ensembles of TCs in order to cover the full range of possible activity that results from a given climate change. I construct and apply a statistical track model from the 1945-2007 observed "best tracks" in the IBTrACS database for the WNP. The lifecycle components--genesis, track propagation, and death--of each simulated track is determined stochastically based on the statistics of historical occurrences. The length scale that dictates what historical data to consider as "local" for each lifecycle component is calculated objectively through optimization. Overall, I demonstrate how a statistical model can be used as a tool to translate climate-induced changes in TC activity into implications for risk.In contrast to other studies, I show that the El Niño/Southern Oscillation (ENSO) has an effect on track propagation separate from the genesis effect. The ENSO effect on genesis results in higher landfall rates during La Niña years due to the shift in genesis location to the northeast. The effect on tracks is more geographically and seasonally varied due to local changes in the mid-level winds. I use local regression techniques to capture the relationship between ENSO, cyclogenesis, and track propagation. Stationary climate simulations are run for extreme ENSO states in order to better understand changes in TC activity and their implication for regional landfall. Additionally, Several diagnostics are performed on model realizations of the historical period, confirming the model's ability to capture the geographical distribution and interannual variability of observed TCs. Lastly, as a step to connect synthetic TC track simulations to economic damage risk assessment, I use a Damage Index and total damage data for U.S. landfalling hurricanes and fit generalized Pareto distributions (GPD) to them. The Damage Index uniquely separates out the effects of the physical damage capacity of a TC and the local economic vulnerability of a coastal region. GPD fits are also performed using covariates in the scale parameter, where bathymetric slope and landfall intensity are found to be useful covariates for the Damage Index. Using the Damage Index with covariates model, two examples are shown of assessing damage risk for different climates. The first takes landfall data input from a statistical-deterministic TC model downscaled from GFDL and ECHAM model current and future climates. The second takes landfall data from a fully statistical track model with different values of relative sea surface temperature given as a statistical predictor.Atmospheric sciences, Statistics, Climate changeey2111Applied Physics and Applied Mathematics, Earth and Environmental SciencesDissertationsSegregation in Social Networks Based on Acquaintanceship and Trust
http://academiccommons.columbia.edu/catalog/ac:154740
DiPrete, Thomas A.; Gelman, Andrew E.; McCormick, Tyler; Teitler, Julien O.; Zheng, Tianhttp://hdl.handle.net/10022/AC:P:15339Tue, 20 Nov 2012 00:00:00 +0000Using 2006 General Social Survey data, the authors compare levels of segregation by race and along other dimensions of potential social cleavage in the contemporary United States. Americans are not as isolated as the most extreme recent estimates suggest. However, hopes that “bridging” social capital is more common in broader acquaintanceship networks than in core networks are not supported. Instead, the entire acquaintanceship network is perceived by Americans to be about as segregated as the much smaller network of close ties. People do not always know the religiosity, political ideology, family behaviors, or socioeconomic status of their acquaintances, but perceived social divisions on these dimensions are high, sometimes rivaling racial segregation in acquaintanceship networks. The major challenge to social integration today comes from the tendency of many Americans to isolate themselves from others who differ on race, political ideology, level of religiosity, and other salient aspects of social identity.Statisticstad61, ag389 , thm2105, jot8, tz33Political Science, Sociology, Statistics, Social WorkArticlesBayesian Statistical Pragmatism
http://academiccommons.columbia.edu/catalog/ac:154737
Gelman, Andrew E.http://hdl.handle.net/10022/AC:P:15340Tue, 20 Nov 2012 00:00:00 +0000I agree with Rob Kass’ point that we can and should make use of statistical methods developed under different philosophies, and I am happy to take the opportunity to elaborate on some of his arguments.Statisticsag389 Political Science, StatisticsArticlesMultiple Imputation with Diagnostics (mi) in R: Opening Windows into the Black Box
http://academiccommons.columbia.edu/catalog/ac:154731
Su, Yu-Sung; Yajima, Masanao; Gelman, Andrew E.; Hill, Jenniferhttp://hdl.handle.net/10022/AC:P:15342Tue, 20 Nov 2012 00:00:00 +0000Our mi package in R has several features that allow the user to get inside the imputation process and evaluate the reasonableness of the resulting models and imputations. These features include: choice of predictors, models, and transformations for chained imputation models; standard and binned residual plots for checking the fit of the conditional distributions used for imputation; and plots for comparing the distributions of observed and imputed data. In addition, we use Bayesian models and weakly informative prior distributions to construct more stable estimates of imputation models. Our goal is to have a demonstration package that (a) avoids many of the practical problems that arise with existing multivariate imputation programs, and (b) demonstrates state-of-the-art diagnostics that can be applied more generally and can be incorporated into the software of others.Statisticsag389 Political Science, StatisticsArticlesR2WinBUGS: A Package for Running WinBUGS from R
http://academiccommons.columbia.edu/catalog/ac:154734
Sturtz, Sibylle; Ligges, Uwe; Gelman, Andrew E.http://hdl.handle.net/10022/AC:P:15341Tue, 20 Nov 2012 00:00:00 +0000The R2WinBUGS package provides convenient functions to call WinBUGS from R. It automatically writes the data and scripts in a format readable by WinBUGS for processing in batch mode, which is possible since version 1.4. After the WinBUGS process has finished, it is possible either to read the resulting data into R by the package itself—which gives a compact graphical summary of inference and convergence diagnostics—or to use the facilities of the coda package for further analyses of the output. Examples are given to demonstrate the usage of this package.Statisticsag389 Political Science, StatisticsArticlesContributions to Semiparametric Inference to Biased-Sampled and Financial Data
http://academiccommons.columbia.edu/catalog/ac:177018
Sit, Tonyhttp://hdl.handle.net/10022/AC:P:14685Wed, 12 Sep 2012 00:00:00 +0000This thesis develops statistical models and methods for the analysis of life-time and financial data under the umbrella of semiparametric framework. The first part studies the use of empirical likelihood on Levy processes that are used to model the dynamics exhibited in the financial data. The second part is a study of inferential procedure for survival data collected under various biased sampling schemes in transformation and the accelerated failure time models. During the last decade Levy processes with jumps have received increasing popularity for modelling market behaviour for both derivative pricing and risk management purposes. Chan et al. (2009) introduced the use of empirical likelihood methods to estimate the parameters of various diffusion processes via their characteristic functions which are readily available in most cases. Return series from the market are used for estimation. In addition to the return series, there are many derivatives actively traded in the market whose prices also contain information about parameters of the underlying process. This observation motivates us to combine the return series and the associated derivative prices observed at the market so as to provide a more reflective estimation with respect to the market movement and achieve a gain in efficiency. The usual asymptotic properties, including consistency and asymptotic normality, are established under suitable regularity conditions. We performed simulation and case studies to demonstrate the feasibility and effectiveness of the proposed method. The second part of this thesis investigates a unified estimation method for semiparametric linear transformation models and accelerated failure time model under general biased sampling schemes. The methodology proposed is first investigated in Paik (2009) in which the length-biased case is considered for transformation models. The new estimator is obtained from a set of counting process-based unbiased estimating equations, developed through introducing a general weighting scheme that offsets the sampling bias. The usual asymptotic properties, including consistency and asymptotic normality, are established under suitable regularity conditions. A closed-form formula is derived for the limiting variance and the plug-in estimator is shown to be consistent. We demonstrate the unified approach through the special cases of left truncation, length-bias, the case-cohort design and variants thereof. Simulation studies and applications to real data sets are also presented.Statisticsts2500StatisticsDissertationsDetecting Dependence Change Points in Multivariate Time Series with Applications in Neuroscience and Finance
http://academiccommons.columbia.edu/catalog/ac:177012
Cribben, Ivor Johnhttp://hdl.handle.net/10022/AC:P:14681Wed, 12 Sep 2012 00:00:00 +0000In many applications there are dynamic changes in the dependency structure between multivariate time series. Two examples include neuroscience and finance. The second and third chapters focus on neuroscience and introduce a data-driven technique for partitioning a time course into distinct temporal intervals with different multivariate functional connectivity patterns between a set of brain regions of interest (ROIs). The technique, called Dynamic Connectivity Regression (DCR), detects temporal change points in functional connectivity and estimates a graph, or set of relationships between ROIs, for data in the temporal partition that falls between pairs of change points. Hence, DCR allows for estimation of both the time of change in connectivity and the connectivity graph for each partition, without requiring prior knowledge of the nature of the experimental design. Permutation and bootstrapping methods are used to perform inference on the change points. In the second chapter of this work, we focus on multi-subject data while in the third chapter, we concentrate on single-subject data and extend the DCR methodology in two ways: (i) we alter the algorithm to make it more accurate for individual subject data with a small number of observations and (ii) we perform inference on the edges or connections between brain regions in order to reduce the number of false positives in the graphs. We also discuss a Likelihood Ratio test to compare precision matrices (inverse covariance matrices) across subjects as well as a test across subjects on the single edges or partial correlations in the graph. In the final chapter of this work, we turn to a finance setting. We use the same DCR technique to detect changes in dependency structure in multivariate financial time series for situations where both the placement and number of change points is unknown. In this setting, DCR finds the dependence change points and estimates an undirected graph representing the relationship between time series within each interval created by pairs of adjacent change points. A shortcoming of the proposed DCR methodology is the presence of an excessive number of false positive edges in the undirected graphs, especially when the data deviates from normality. Here we address this shortcoming by proposing a procedure for performing inference on the edges, or partial dependencies between time series, that effectively removes false positive edges. We also discuss two robust estimation procedures based on ranks and the tlasso (Finegold and Drton, 2011) technique, which we contrast with the glasso technique used by DCR.Statisticsijc2104StatisticsDissertationsModeling Strategies for Large Dimensional Vector Autoregressions
http://academiccommons.columbia.edu/catalog/ac:152472
Zang, Pengfeihttp://hdl.handle.net/10022/AC:P:14666Tue, 11 Sep 2012 00:00:00 +0000The vector autoregressive (VAR) model has been widely used for describing the dynamic behavior of multivariate time series. However, fitting standard VAR models to large dimensional time series is challenging primarily due to the large number of parameters involved. In this thesis, we propose two strategies for fitting large dimensional VAR models. The first strategy involves reducing the number of non-zero entries in the autoregressive (AR) coefficient matrices and the second is a method to reduce the effective dimension of the white noise covariance matrix. We propose a 2-stage approach for fitting large dimensional VAR models where many of the AR coefficients are zero. The first stage provides initial selection of non-zero AR coefficients by taking advantage of the properties of partial spectral coherence (PSC) in conjunction with BIC. The second stage, based on $t$-ratios and BIC, further refines the spurious non-zero AR coefficients post first stage. Our simulation study suggests that the 2-stage approach outperforms Lasso-type methods in discovering sparsity patterns in AR coefficient matrices of VAR models. The performance of our 2-stage approach is also illustrated with three real data examples. Our second strategy for reducing the complexity of a large dimensional VAR model is based on a reduced-rank estimator for the white noise covariance matrix. We first derive the reduced-rank covariance estimator under the setting of independent observations and give the analytical form of its maximum likelihood estimate. Then we describe how to integrate the proposed reduced-rank estimator into the fitting of large dimensional VAR models, where we consider two scenarios that require different model fitting procedures. In the VAR modeling context, our reduced-rank covariance estimator not only provides interpretable descriptions of the dependence structure of VAR processes but also leads to improvement in model-fitting and forecasting over unrestricted covariance estimators. Two real data examples are presented to illustrate these fitting procedures.Statisticspz2146StatisticsDissertationsSome Models for Time Series of Counts
http://academiccommons.columbia.edu/catalog/ac:152149
Liu, Henghttp://hdl.handle.net/10022/AC:P:14561Wed, 29 Aug 2012 00:00:00 +0000This thesis focuses on developing nonlinear time series models and establishing relevant theory with a view towards applications in which the responses are integer valued. The discreteness of the observations, which is not appropriate with classical time series models, requires novel modeling strategies. The majority of the existing models for time series of counts assume that the observations follow a Poisson distribution conditional on an accompanying intensity process that drives the serial dynamics of the model. According to whether the evolution of the intensity process depends on the observations or solely on an external process, the models are classified into parameter-driven and observation-driven. Compared to the former one, an observation-driven model often allows for easier and more straightforward estimation of the model parameters. On the other hand, the stability properties of the process, such as the existence and uniqueness of a stationary and ergodic solution that are required for deriving asymptotic theory of the parameter estimates, can be quite complicated to establish, as compared to parameter-driven models. In this thesis, we first propose a broad class of observation-driven models that is based upon a one-parameter exponential family of distributions and incorporates nonlinear dynamics. The establishment of stability properties of these processes, which is at the heart of this thesis, is addressed by employing theory from iterated random functions and coupling techniques. Using this theory, we are also able to obtain the asymptotic behavior of maximum likelihood estimates of the parameters. Extensions of the base model in several directions are considered. Inspired by the idea of a self-excited threshold ARMA process, a threshold Poisson autoregression is proposed. It introduces a two-regime structure in the intensity process and essentially allows for modeling negatively correlated observations. E-chain, a non-standard Markov chain technique and Lyapunov's method are utilized to show the stationarity and a law of large numbers for this process. In addition, the model has been adapted to incorporate covariates, an important problem of practical and primary interest. The base model is also extended to consider the case of multivariate time series of counts. Given a suitable definition of a multivariate Poisson distribution, a multivariate Poisson autoregression process is described and its properties studied. Several simulation studies are presented to illustrate the inference theory. The proposed models are also applied to several real data sets, including the number of transactions of the Ericsson stock, the return times of Goldman Sachs Group stock prices, the number of road crashes in Schiphol, the frequencies of occurrences of gold particles, the incidences of polio in the US and the number of presentations of asthma in an Australian hospital. An array of graphical and quantitative diagnostic tools, which is specifically designed for the evaluation of goodness of fit for time series of counts models, is described and illustrated with these data sets.Statisticshl2494StatisticsDissertationsStatistical inference in two non-standard regression problems
http://academiccommons.columbia.edu/catalog/ac:151460
Seijo, Emilio Franciscohttp://hdl.handle.net/10022/AC:P:14317Wed, 08 Aug 2012 00:00:00 +0000This thesis analyzes two regression models in which their respective least squares estimators have nonstandard asymptotics. It is divided in an introduction and two parts. The introduction motivates the study of nonstandard problems and presents an outline of the contents of the remaining chapters. In part I, the least squares estimator of a multivariate convex regression function is studied in great detail. The main contribution here is a proof of the consistency of the aforementioned estimator in a completely nonparametric setting. Model misspecification, local rates of convergence and multidimensional regression models mixing convexity and componentwise monotonicity constraints will also be considered. Part II deals with change-point regression models and the issues that might arise when applying the bootstrap to these problems. The classical bootstrap is shown to be inconsistent on a simple change-point regression model, and an alternative (smoothed) bootstrap procedure is proposed and proved to be consistent. The superiority of the alternative method is also illustrated through a simulation study. In addition, a version of the continuous mapping theorem specially suited for change-point estimators is proved and used to derive the results concerning the bootstrap.Statistics, Applied mathematics, Mathematicsefs2113StatisticsDissertationsSparse selection in Cox models with functional predictors
http://academiccommons.columbia.edu/catalog/ac:147707
Zhang, Yuleihttp://hdl.handle.net/10022/AC:P:13445Thu, 07 Jun 2012 00:00:00 +0000This thesis investigates sparse selection in the Cox regression models with functional predictors. Interest in sparse selection with functional predictors (Lindquist and McKeague, 2009; McKeague and Sen, 2010) can arise in biomedical studies. A functional predictor is a predictor with a trajectory which is usually indexed by time, location or other factors. When the trajectory of a covariate is observed for each subject, and we need to identify a common "sensitive" point of these trajectories which drives outcome, the problem can be formulated as sparse selection with functional predictors. For example, we may locate a gene that is associated to cancer risk along a chromosome. The functional linear regression method is widely used for the analysis of functional covariates. However, it could lack interpretability. The method we develop in this thesis has straightforward interpretation since it relates the hazard to some sensitive components of functional covariates. The Cox regression model has been extensively studied in the analysis of time-to-event data. In this thesis, we extend it to allow for sparse selection with functional predictors. Using the partial likelihood as the criterion function, and following the 3-step procedure for M-estimators established in van der Vaart and Wellner (1996), the consistency, rate of convergence and asymptotic distribution are obtained for M-estimators of the sensitive point and the regression coefficients. In this thesis, to study these large sample properties of the estimators, the fractional Brownian motion assumption is posed for the trajectories for mathematical tractability. Simulations are conducted to evaluate the finite sample performance of the methods, and a way to construct the confidence interval for the location parameter, i.e., the sensitive point, is proposed. The proposed method is applied to an adult brain cancer study and a breast cancer study to find the sensitive point, here the locus of a chromosome, which is closely related to cancer mortality. Since the breast cancer data set has missing values, we investigate the impact of varying proportions of missingness in the data on the accuracy of our estimator as well.Biostatistics, Statistics, Applied mathematicsyz2157BiostatisticsDissertationsThe Relation between Uncertainty in Latent Class Membership and Outcomes in a Latent Class Signal Detection Model
http://academiccommons.columbia.edu/catalog/ac:146637
Cheng, Zhifenhttp://hdl.handle.net/10022/AC:P:13139Fri, 04 May 2012 00:00:00 +0000Latent class variables are often used to predict outcomes. The conventional practice is to first assign observations to one of the latent classes based on the maximum posterior probabilities. The assigned class membership is then treated as an observed variable and used in predicting the outcomes. This widely used classify-analyze strategy ignores the uncertainty of being in a certain latent class for the observations. Once an observation is classified to the latent class with the highest posterior probability, its probability of being in the assigned class is treated as being one. In addition, once observations are classified to the latent class with the highest posterior probability, their representativeness of the class becomes the same because they will all have a probability of one of being in the assigned class. Finally, standard errors are underestimated because the residual uncertainty about the latent class membership is ignored. This dissertation used simulation studies and an analysis of a real-world data set to compare five commonly adopted approaches (most likely class regression, probability regression, probability-weighted regression, pseudo-class regression, and the simultaneous approach) for measuring the association between a latent class variable and outcome variables to see which one can better account for the uncertainty in latent class membership in such a situation. The model considered in the study was a latent class extension of the signal detection model (LC-SDT) by DeCarlo, which has proved to be able to address certain measurement issues in the educational field, more specifically, rater issues involved in essay grading such as rater effects and rater reliability. An LC-SDT model has the potential for wide applications in education as well as other areas. Therefore it is important to explore the issue of accounting for uncertainty in latent class membership within this framework. Three ordinal outcome variables having a negative, weak, and strong association with the latent class variable were considered in the simulations. Results of the simulations showed that the simultaneous approach performed best in obtaining unbiased parameter estimates. It also yielded larger standard errors than the other approaches which have been found by previous research to underestimate standard errors. Even though the simultaneous approach has its advantages, including outcome variables in a latent class model can affect parameters of the response variables. Therefore, cautions need to be taken when using this approach. The analysis results of the real-world data set confirmed the trends observed in the simulation studies.Quantitative psychology and psychometrics, Educational psychology, Statisticszc2133Human Development, Measurement and EvaluationDissertationsStatistics for Learning Genetics
http://academiccommons.columbia.edu/catalog/ac:146201
Charles, Abigail Sheenahttp://hdl.handle.net/10022/AC:P:13015Tue, 17 Apr 2012 00:00:00 +0000This study investigated the knowledge and skills that biology students may need to help them understand statistics/mathematics as it applies to genetics. The data are based on analyses of current representative genetics texts, practicing genetics professors' perspectives, and more directly, students' perceptions of, and performance in, doing statistically-based genetics problems. This issue is at the emerging edge of modern college-level genetics instruction, and this study attempts to identify key theoretical components for creating a specialized biological statistics curriculum. The goal of this curriculum will be to prepare biology students with the skills for assimilating quantitatively-based genetic processes, increasingly at the forefront of modern genetics. To fulfill this, two college level classes at two universities were surveyed. One university was located in the northeastern US and the other in the West Indies. There was a sample size of 42 students and a supplementary interview was administered to a select 9 students. Interviews were also administered to professors in the field in order to gain insight into the teaching of statistics in genetics. Key findings indicated that students had very little to no background in statistics (55%). Although students did perform well on exams with 60% of the population receiving an A or B grade, 77% of them did not offer good explanations on a probability question associated with the normal distribution provided in the survey. The scope and presentation of the applicable statistics/mathematics in some of the most used textbooks in genetics teaching, as well as genetics syllabi used by instructors do not help the issue. It was found that the text books, often times, either did not give effective explanations for students, or completely left out certain topics. The omission of certain statistical/mathematical oriented topics was seen to be also true with the genetics syllabi reviewed for this study. Nonetheless, although the necessity for infusing these quantitative subjects with genetics and, overall, the biological sciences is growing (topics including synthetic biology, molecular systems biology and phylogenetics) there remains little time in the semester to be dedicated to the consolidation of learning and understanding.Mathematics education, Statistics, Geneticsasc2119Mathematics, Science, and Technology, Mathematics EducationDissertationsEditorial: Special Section on Statistical and Perceptual Audio Processing
http://academiccommons.columbia.edu/catalog/ac:144493
Ellis, Daniel P. W.; Raj, Bhiksha; Brown, Judith C.; Slaney, Malcolm; Smaragdis, Parishttp://hdl.handle.net/10022/AC:P:12565Wed, 15 Feb 2012 00:00:00 +0000Human perception has always been an inspiration for automatic processing systems, not least because tasks such as speech recognition only exist because people do themâ€”and, indeed, without that example we might wonder if they were possible at all. As computational power grows, we have increasing opportunities to model and duplicate perceptual abilities with greater fidelity, and, most importantly, based on larger and larger amounts of raw data describing both what signals exist in the real world, and how people respond to them. The power to deal with large data sets has meant that approaches that were once mere theoretical possibilities, such as exhaustive search of exponentially-sized codebooks, or real-time direct convolution of long sequences, have become increasingly practical and even unremarkable. A major consequence of this is the growth of statistical or corpus-based approaches, where complex relations, discriminations, or structures are inferred directly from example data (for instance by optimizing the parameters of a very general algorithm). An increasing number of complex tasks can be given empirically optimal solutions based on large, representative datasets. The traditional idea of perceptually-inspired processing is to develop a machine algorithm for a complex task such as melody recognition or source separation through inspiration and introspection about how individuals perform the task, and on the basis of direct psychological or neurophysiological data. The results can appear to be at odds with the statistical perspective, since perceptually-motivated work is often ad-hoc, comprising many stages whose individual contributions are difficult to separate. We believe that it is important to unify these two approaches: to employ rigorous, exhaustive techniques taking advantage of the statistics of large data sets to develop and solve perceptually-based and subjectively-defined problems. With this in mind, we organized a one-day workshop on Statistical and Perceptual Audio Processing as a satellite to the International Conference on Spoken Language Processing (ICSLP-INTERSPEECH), held in Jeju, Korea, in September 2004.Statistics, Physiological psychologyde171Electrical EngineeringArticlesState-Space Models and Latent Processes in the Statistical Analysis of Neural Data
http://academiccommons.columbia.edu/catalog/ac:142761
Vidne, Michaelhttp://hdl.handle.net/10022/AC:P:12050Tue, 20 Dec 2011 00:00:00 +0000This thesis develops and applies statistical methods for the analysis of neural data. In the second chapter we incorporate a latent process to the Generalized Linear Model framework. We develop and apply our framework to estimate the linear filters of an entire population of retinal ganglion cells while taking into account the effects of common-noise the cells might share. We are able to capture the encoding and decoding of visual stimulus to neural code. Our formalism gives us insight into the underlying architecture of the neural system. And we are able to estimate the common-noise that the cells receive. In the third chapter we discuss methods for optimally inferring the synaptic inputs to an electrotonically compact neuron, given intracellular voltage-clamp or current-clamp recordings from the postsynaptic cell. These methods are based on sequential Monte Carlo techniques ("particle filtering"). We demonstrate, on model data, that these methods can recover the time course of excitatory and inhibitory synaptic inputs accurately on a single trial. In the fourth chapter we develop a more general approach to the state-space filtering problem. Our method solves the same recursive set of Markovian filter equations as the particle filter, but we replace all importance sampling steps with a more general Markov chain Monte Carlo (MCMC) step. Our algorithm is especially well suited for problems where the model parameters might be misspecified.Applied mathematics, Statistics, Neurosciencesmv333Applied Physics and Applied MathematicsDissertationsMultiscale Representations for Manifold-Valued Data
http://academiccommons.columbia.edu/catalog/ac:140178
Rahman, Inam Ur; Drori, Iddo; Stodden, Victoria C.; Donoho, David L.; Schroeder, Peterhttp://hdl.handle.net/10022/AC:P:11434Tue, 11 Oct 2011 00:00:00 +0000We describe multiscale representations for data observed on equispaced grids and taking values in manifolds such as: the sphere S2, the special orthogonal group SO(3), the positive definite matrices SPD(n), and the Grassmann manifolds G(n, k). The representations are based on the deployment of Deslauriers-Dubuc and Average Interpolating pyramids "in the tangent plane" of such manifolds, using the Exp and Log maps of those manifolds. The representations provide "wavelet coefficients" which can be thresholded, quantized, and scaled much as traditional wavelet coefficients. Tasks such as compression, noise removal, contrast enhancement, and stochastic simulation are facilitated by this representation. The approach applies to general manifolds, but is particularly suited to the manifolds we consider, i.e. Riemanian symmetric spaces, such as Sn−1, SO(n), G(n, k), where the Exp and Log maps are effectively computable. Applications to manifold-valued data sources of a geometric nature (motion, orientation, diffusion) seem particularly immediate. A software toolbox, SymmLab, can reproduce the results discussed in this paper.Statisticsvcs2115StatisticsArticlesBreakdown Point of Model Selection When the Number of Variables Exceeds the Number of Observations
http://academiccommons.columbia.edu/catalog/ac:140168
Donoho, David L.; Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:11431Tue, 11 Oct 2011 00:00:00 +0000The classical multivariate linear regression problem assumes p variables X1, X2, ... , Xp and a response vector y, each with n observations, and a linear relationship between the two: y = X beta + z, where z ~ N(0, sigma2). We point out that when p > n, there is a breakdown point for standard model selection schemes, such that model selection only works well below a certain critical complexity level depending on n/p. We apply this notion to some standard model selection algorithms (Forward Stepwise, LASSO, LARS) in the case where pGtn. We find that 1) the breakdown point is well-de ned for random X-models and low noise, 2) increasing noise shifts the breakdown point to lower levels of sparsity, and reduces the model recovery ability of the algorithm in a systematic way, and 3) below breakdown, the size of coefficient errors follows the theoretical error distribution for the classical linear model.Statisticsvcs2115StatisticsArticlesWhen Does Non-Negative Matrix Factorization Give a Correct Decomposition into Parts?
http://academiccommons.columbia.edu/catalog/ac:140175
Donoho, David L.; Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:11433Tue, 11 Oct 2011 00:00:00 +0000We interpret non-negative matrix factorization geometrically, as the problem of finding a simplicial cone which contains a cloud of data points and which is contained in the positive orthant. We show that under certain conditions, basically requiring that some of the data are spread across the faces of the positive orthant, there is a unique such simplicial cone. We give examples of synthetic image articulation databases which obey these conditions; these require separated support and factorial sampling. For such databases there is a generative model in terms of "parts" and NMF correctly identifies the "parts". We show that our theoretical results are predictive of the performance of published NMF code, by running the published algorithms on one of our synthetic image articulation databases.Statisticsvcs2115StatisticsArticlesFast l1 Minimization for Genomewide Analysis of mRNA Lengths
http://academiccommons.columbia.edu/catalog/ac:140172
Drori, Iddo; Stodden, Victoria C.; Hurowitz, Evan H.Tue, 11 Oct 2011 00:00:00 +0000Application of the virtual northern method to human mRNA allows us to systematically measure transcript length on a genome-wide scale [1]. Characterization of RNA transcripts by length provides a measurement which complements cDNA sequencing. We have robustly extracted the lengths of the transcripts expressed by each gene for comparison with the Unigene, Refseq, and H-Invitational databases [2, 3]. Obtaining an accurate probability for each peak requires performing multiple bootstrap simulations, each involving a deconvolution operation which is equivalent to finding the sparsest non-negative solution of an underdetermined system of equations. This process is computationally intensive for a large number of simulations and genes. In this contribution we present an efficient approximation method which is faster than general purpose solvers by two orders of magnitude, and in practice reduces our processing time from a week to hours.Genetics, Statisticsvcs2115StatisticsArticlesOn testing the change-point in the longitudinal bent line quantile regression model
http://academiccommons.columbia.edu/catalog/ac:139275
Sha, Nanshihttp://hdl.handle.net/10022/AC:P:11290Wed, 28 Sep 2011 00:00:00 +0000The problem of detecting changes has been receiving considerable attention in various fields. In general, the change-point problem is to identify the location(s) in an ordered sequence that divides this sequence into groups, which follow different models. This dissertation considers the change-point problem in quantile regression for observational or clinical studies involving correlated data (e.g. longitudinal studies) . Our research is motivated by the lack of ideal inference procedures for such models. Our contributions are two-fold. First, we extend the previously reported work on the bent line quantile regression model [Li et al. (2011)] to a longitudinal framework. Second, we propose a score-type test for hypothesis testing of the change-point problem using rank-based inference. The proposed test in this thesis has several advantages over the existing inferential approaches. Most importantly, it circumvents the difficulties of estimating nuisance parameters (e.g. density function of unspecified error) as required for the Wald test in previous works and thus is more reliable in finite sample performance. Furthermore, we demonstrate, through a series of simulations, that the proposed methods also outperform the extensively used bootstrap methods by providing more accurate and computationally efficient confidence intervals. To illustrate the usage of our methods, we apply them to two datasets from real studies: the Finnish Longitudinal Growth Study and an AIDS clinical trial. In each case, the proposed approach sheds light on the response pattern by providing an estimated location of abrupt change along with its 95% confidence interval at any quantile of interest — a key parameter with clinical implications. The proposed methods allow for different change-points at different quantile levels of the outcome. In this way, they offer a more comprehensive picture of the covariate effects on the response variable than is provided by other change-point models targeted exclusively on the conditional mean. We conclude that our framework and proposed methodology are valuable for studying the change-point problem involving longitudinal data.Statisticsns2397BiostatisticsDissertationsSelf-controlled methods for postmarketing drug safety surveillance in large-scale longitudinal data
http://academiccommons.columbia.edu/catalog/ac:137551
Simpson, Shawn E.http://hdl.handle.net/10022/AC:P:10963Mon, 22 Aug 2011 00:00:00 +0000A primary objective in postmarketing drug safety surveillance is to ascertain the relationship between time-varying drug exposures and adverse events (AEs) related to health outcomes. Surveillance can be based on longitudinal observational databases (LODs), which contain time-stamped patient-level medical information including periods of drug exposure and dates of diagnoses. Due to its desirable properties, we focus on the self-controlled case series (SCCS) method for analysis in this context. SCCS implicitly controls for fixed multiplicative baseline covariates since each individual acts as their own control. In addition, only exposed cases are required for the analysis, which is computationally advantageous. In the first part of this work we present how the simple SCCS model can be applied to the surveillance problem, and compare the results of simple SCCS to those of existing methods. Many current surveillance methods are based on marginal associations between drug exposures and AEs. Such analyses ignore confounding drugs and interactions and have the potential to give misleading results. In order to avoid these difficulties, it is desirable for an analysis strategy to incorporate large numbers of time-varying potential confounders such as other drugs. In the second part of this work we propose the Bayesian multiple SCCS approach, which deals with high dimensionality and can provide a sparse solution via a Laplacian prior. We present details of the model and optimization procedure, as well as results of empirical investigations. SCCS is based on a conditional Poisson regression model, which assumes that events at different time points are conditionally independent given the covariate process. This requirement is problematic when the occurrence of an event can alter the future event risk. In a clinical setting, for example, patients who have a first myocardial infarction (MI) may be at higher subsequent risk for a second. In the third part of this work we propose the positive dependence self-controlled case series (PD-SCCS) method: a generalization of SCCS that allows the occurrence of an event to increase the future event risk, yet maintains the advantages of the original by controlling for fixed baseline covariates and relying solely on data from cases. We develop the model and compare the results of PD-SCCS and SCCS on example drug-AE pairs.Statisticsses2155StatisticsDissertationsRater Drift in Constructed Response Scoring via Latent Class Signal Detection Theory and Item Response Theory
http://academiccommons.columbia.edu/catalog/ac:132272
Park, Yoon Soohttp://hdl.handle.net/10022/AC:P:10394Tue, 17 May 2011 00:00:00 +0000The use of constructed response (CR) items or performance tasks to assess test takers' ability has grown tremendously over the past decade. Examples of CR items in psychological and educational measurement range from essays, works of art, and admissions interviews. However, unlike multiple-choice (MC) items that have predetermined options, CR items require test takers to construct their own answer. As such, they require the judgment of multiple raters that are subject to differences in perception and prior knowledge of the material being evaluated. As with any scoring procedure, the scores assigned by raters must be comparable over time and over different test administrations and forms; in other words, scores must be reliable and valid for all test takers, regardless of when an individual takes the test. This study examines how longitudinal patterns or changes in rater behavior affect model-based classification accuracy. Rater drift refers to changes in rater behavior across different test administrations. Prior research has found evidence of drift. Rater behavior in CR scoring is examined using two measurement models - latent class signal detection theory (SDT) and item response theory (IRT) models. Rater effects (e.g., leniency and strictness) are partly examined with simulations, where the ability of different models to capture changes in rater behavior is studied. Drift is also examined in two real-world large scale tests: teacher certification test and high school writing test. These tests use the same set of raters for long periods of time, where each rater's scoring is examined on a monthly basis. Results from the empirical analysis showed that rater models were effective to detect changes in rater behavior over testing administrations in real-world data. However, there were differences in rater discrimination between the latent class SDT and IRT models. Simulations were used to examine the effect of rater drift on classification accuracy and on differences between the latent class SDT and IRT models. Changes in rater severity had only a minimal effect on classification. Rater discrimination had a greater effect on classification accuracy. This study also found that IRT models detected changes in rater severity and in rater discrimination even when data were generated from the latent class SDT model. However, when data were non-normal, IRT models underestimated rater discrimination, which may lead to incorrect inferences on the precision of raters. These findings provide new and important insights into CR scoring and issues that emerge in practice, including methods to improve rater training.Quantitative psychology and psychometrics, Educational tests and measurements, Statisticsysp2102Human Development, National Center for Disaster Preparedness, Measurement and EvaluationDissertationsSome Nonparametric Methods for Clinical Trials and High Dimensional Data
http://academiccommons.columbia.edu/catalog/ac:174242
Wu, Xiaoruhttp://hdl.handle.net/10022/AC:P:10335Wed, 11 May 2011 00:00:00 +0000This dissertation addresses two problems from novel perspectives. In chapter 2, I propose an empirical likelihood based method to nonparametrically adjust for baseline covariates in randomized clinical trials and in chapter 3, I develop a survival analysis framework for multivariate K-sample problems. (I): Covariate adjustment is an important tool in the analysis of randomized clinical trials and observational studies. It can be used to increase efficiency and thus power, and to reduce possible bias. While most statistical tests in randomized clinical trials are nonparametric in nature, approaches for covariate adjustment typically rely on specific regression models, such as the linear model for a continuous outcome, the logistic regression model for a dichotomous outcome, and the Cox model for survival time. Several recent efforts have focused on model-free covariate adjustment. This thesis makes use of the empirical likelihood method and proposes a nonparametric approach to covariate adjustment. A major advantage of the new approach is that it automatically utilizes covariate information in an optimal way without fitting a nonparametric regression. The usual asymptotic properties, including the Wilks-type result of convergence to a chi-square distribution for the empirical likelihood ratio based test, and asymptotic normality for the corresponding maximum empirical likelihood estimator, are established. It is also shown that the resulting test is asymptotically most powerful and that the estimator for the treatment effect achieves the semiparametric efficiency bound. The new method is applied to the Global Use of Strategies to Open Occluded Coronary Arteries (GUSTO)-I trial. Extensive simulations are conducted, validating the theoretical findings. This work is not only useful for nonparametric covariate adjustment but also has theoretical value. It broadens the scope of the traditional empirical likelihood inference by allowing the number of constraints to grow with the sample size. (II): Motivated by applications in high-dimensional settings, I propose a novel approach to testing equality of two or more populations by constructing a class of intensity centered score processes. The resulting tests are analogous in spirit to the well-known class of weighted log-rank statistics that is widely used in survival analysis. The test statistics are nonparametric, computationally simple and applicable to high-dimensional data. We establish the usual large sample properties by showing that the underlying log-rank score process converges weakly to a Gaussian random field with zero mean under the null hypothesis, and with a drift under the contiguous alternatives. For the Kolmogorov-Smirnov-type and the Cramer-von Mises-type statistics, we also establish the consistency result for any fixed alternative. As a practical means to obtain approximate cutoff points for the test statistics, a simulation based resampling method is proposed, with theoretical justification given by establishing weak convergence for the randomly weighted log-rank score process. The new approach is applied to a study of brain activation measured by functional magnetic resonance imaging when performing two linguistic tasks and also to a prostate cancer DNA microarray data set.Statisticsxw2144StatisticsDissertationsContagion and Systemic Risk in Financial Networks
http://academiccommons.columbia.edu/catalog/ac:131474
Moussa, Amalhttp://hdl.handle.net/10022/AC:P:10249Fri, 29 Apr 2011 00:00:00 +0000The 2007-2009 financial crisis has shed light on the importance of contagion and systemic risk, and revealed the lack of adequate indicators for measuring and monitoring them. This dissertation addresses these issues and leads to several recommendations for the design of an improved assessment of systemic importance, improved rating methods for structured finance securities, and their use by investors and risk managers. Using a complete data set of all mutual exposures and capital levels of financial institutions in Brazil in 2007 and 2008, we explore in chapter 2 the structure and dynamics of the Brazilian financial system. We show that the Brazilian financial system exhibits a complex network structure characterized by a strong degree of heterogeneity in connectivity and exposure sizes across institutions, which is qualitatively and quantitatively similar to the statistical features observed in other financial systems. We find that the Brazilian financial network is well represented by a directed scale-free network, rather than a small world network. Based on these observations, we propose a stochastic model for the structure of banking networks, representing them as a directed weighted scale free network with power law distributions for in-degree and out-degree of nodes, Pareto distribution for exposures. This model may then be used for simulation studies of contagion and systemic risk in networks. We propose in chapter 3 a quantitative methodology for assessing contagion and systemic risk in a network of interlinked institutions. We introduce the Contagion Index as a metric of the systemic importance of a single institution or a set of institutions, that combines the effects of both common market shocks to portfolios and contagion through counterparty exposures. Using a directed scale-free graph simulation of the financial system, we study the sensitivity of contagion to a change in aggregate network parameters: connectivity, concentration of exposures, heterogeneity in degree distribution and network size. More concentrated and more heterogeneous networks are found to be more resilient to contagion. The impact of connectivity is more controversial: in well-capitalized networks, increasing connectivity improves the resilience to contagion when the initial level of connectivity is high, but increases contagion when the initial level of connectivity is low. In undercapitalized networks, increasing connectivity tends to increase the severity of contagion. We also study the sensitivity of contagion to local measures of connectivity and concentration across counterparties --the counterparty susceptibility and local network frailty-- that are found to have a monotonically increasing relationship with the systemic risk of an institution. Requiring a minimum (aggregate) capital ratio is shown to reduce the systemic impact of defaults of large institutions; we show that the same effect may be achieved with less capital by imposing such capital requirements only on systemically important institutions and those exposed to them. In chapter 4, we apply this methodology to the study of the Brazilian financial system. Using the Contagion Index, we study the potential for default contagion and systemic risk in the Brazilian system and analyze the contribution of balance sheet size and network structure to systemic risk. Our study reveals that, aside from balance sheet size, the network-based local measures of connectivity and concentration of exposures across counterparties introduced in chapter 3, the counterparty susceptibility and local network frailty, contribute significantly to the systemic importance of an institution in the Brazilian network. Thus, imposing an upper bound on these variables could help reducing contagion. We examine the impact of various capital requirements on the extent of contagion in the Brazilian financial system, and show that targeted capital requirements achieve the same reduction in systemic risk with lower requirements in capital for financial institutions. The methodology we proposed in chapter 3 for estimating contagion and systemic risk requires visibility on the entire network structure. Reconstructing bilateral exposures from balance sheets data is then a question of interest in a financial system where bilateral exposures are not disclosed. We propose in chapter 5 two methods to derive a distribution of bilateral exposures matrices. The first method attempts to recover the balance sheet assets and liabilities "sample by sample". Each sample of the bilateral exposures matrix is solution of a relative entropy minimization problem subject to the balance sheet constraints. However, a solution to this problem does not always exist when dealing with sparse sample matrices. Thus, we propose a second method that attempts to recover the assets and liabilities "in the mean". This approach is the analogue of the Weighted Monte Carlo method introduced by Avellaneda et al. (2001). We first simulate independent samples of the bilateral exposures matrix from a relevant prior distribution on the network structure, then we compute posterior probabilities by maximizing the entropy under the constraints that the balance sheet assets and liabilities are recovered in the mean. We discuss the pros and cons of each approach and explain how it could be used to detect systemically important institutions in the financial system. The recent crisis has also raised many questions regarding the meaning of structured finance credit ratings issued by rating agencies and the methodology behind them. Chapter 6 aims at clarifying some misconceptions related to structured finance ratings and how they are commonly interpreted: we discuss the comparability of structured finance ratings with bond ratings, the interaction between the rating procedure and the tranching procedure and its consequences for the stability of structured finance ratings in time. These insights are illustrated in a factor model by simulating rating transitions for CDO tranches using a nested Monte Carlo method. In particular, we show that the downgrade risk of a CDO tranche can be quite different from a bond with same initial rating. Structured finance ratings follow path-dependent dynamics that cannot be adequately described, as usually done, by a matrix of transition probabilities. Therefore, a simple labeling via default probability or expected loss does not discriminate sufficiently their downgrade risk. We propose to supplement ratings with indicators of downgrade risk. To overcome some of the drawbacks of existing rating methods, we suggest a risk-based rating procedure for structured products. Finally, we formulate a series of recommendations regarding the use of credit ratings for CDOs and other structured credit instruments.Finance, Statisticsam2810Industrial Engineering and Operations Research, StatisticsDissertationsStatistical methods for indirectly observed network data
http://academiccommons.columbia.edu/catalog/ac:131447
McCormick, Tyler H.http://hdl.handle.net/10022/AC:P:10239Fri, 29 Apr 2011 00:00:00 +0000Social networks have become an increasingly common framework for understanding and explaining social phenomena. Yet, despite an abundance of sophisticated models, social network research has yet to realize its full potential, in part because of the difficulty of collecting social network data. In many cases, particularly in the social sciences, collecting complete network data is logistically and financially challenging. In contrast, Aggregated Relational Data (ARD) measure network structure indirectly by asking respondents how many connections they have with members of a certain subpopulation (e.g. How many individuals with HIV/AIDS do you know?). These data require no special sampling procedure and are easily incorporated into existing surveys. This research develops a latent space model for ARD. This dissertation proposes statistical methods for methods for estimating social network and population characteristics using one type of social network data collected using standard surveys. First, a method to estimate both individual social network size (i.e., degree) and the distribution of network sizes in a population is prosed. A second method estimates the demographic characteristics of hard-to-reach groups, or latent demographic profiles. These groups, such as those with HIV/AIDS, unlawful immigrants, or the homeless, are often excluded from the sampling frame of standard social science surveys. A third method develops a latent space model for ARD. This method is similar in spirit to previous latent space models for networks (see Hoff, Raftery and Handcock (2002), for example) in that the dependence structure of the network is represented parsimoniously in a multidimensional geometric space. The key distinction from the complete network case is that instead of conditioning on the (latent) distance between two members of the network, the latent space model for ARD conditions on the expected distance between a survey respondent and the center of a subpopulation in the latent space. A spherical latent space facilitates tractable computation of this expectation. This model estimates relative homogeneity between groups in the population and variation in the propensity for interaction between respondents and group members.Statisticsthm2105StatisticsDissertationsDynamic Targeted Pricing in B2B Settings
http://academiccommons.columbia.edu/catalog/ac:130786
Zhang, Zaozaohttp://hdl.handle.net/10022/AC:P:10178Wed, 13 Apr 2011 00:00:00 +0000This research models the impact of firm pricing decisions on different facets of the customer purchasing process in business-to-business (B2B) contexts and develops an integrated framework for inter-temporal targeted pricing to optimize long-term profitability for the firm. Pricing decisions in B2B settings are inherently different from those within the business-to-consumer (B2C) environment, commonly studied in marketing. First, B2B pricing often offers considerable flexibility in implementing first degree and inter-temporal price discrimination, i.e., sellers in B2B contexts can easily vary prices across customers and can even change prices between subsequent purchases by the same customer. While this flexibility affords significant opportunities for the firm, it also requires great care in ensuring long-term profitability. Second, transactions in the B2B environment are often more complex than those in B2C settings. Specifically, the business customer typically makes several interrelated decisions (e.g., when and how much to buy, whether to ask for a quote and whether to accept the seller's bid), which need to be modeled jointly. The proposed model considers these inter-related decisions in an integrated fashion. In addition, the model accounts for heterogeneity in customers' preferences and behaviors, asymmetric reference price effects, and purchase dynamics, while taking into account the short- and long-term implications of the pricing policy. To model the complexity of inter-related joint customer decisions, we use hierarchical Bayesian copulas, which weave together different marginal distributions to form joint distributions. To account for dynamics in purchase behavior and to model the possible long-term impact of experienced prices on the different components of the customer's decision, we use a non-homogenous hidden Markov model with multivariate interrelated state-dependent behaviors. In addition, we rely on the behavioral pricing literature in modeling the effect of price, using asymmetric reference price effects. We calibrated the model using longitudinal transaction data from a metals retailer. The results reveal several substantive insights about the short- and long-term impact of the firm's pricing decisions on each of the inter-related components of the customer's purchasing behavior. Specifically, we find positive interdependence between the quantity and purchase timing decisions and strong negative interdependence between the decision to request a quote and the decision to accept it. Capturing the long-term and asymmetric impact of reference prices, we find that losses not only have larger negative effects relative to gains on customers' buying behavior, but customers also take longer to adapt to losses than they do to gains. Furthermore, the firm's pricing decisions could have a long-term impact on its customers by shifting their preferences between a "vigilant" state - characterized by a cautious approach towards ordering and heightened price sensitivity, and a more "relaxed" state. These dynamics imply that the B2B seller needs to carefully consider both the short- and the long-term consequences of its pricing policy when setting prices for each order. Additionally, the proposed model exhibits superior predictive performance relative to several benchmark models, and in a price policy simulation results in a 52% improvement in profitability compared to the company's current practice. Through pricing simulations performed are made when pricing in volatile economic environments. Other policy simulations are conducted to examine how the B2B firm should price in the recent economic environment with volatile commodity prices. We find when commodity prices increase, the firm should pass the costs to the customers, when the prices decrease, the firm should "hoard" the profit.Marketing, Statistics, Economicszz2122BusinessDissertationsWhy we (usually) don't have to worry about multiple comparison
http://academiccommons.columbia.edu/catalog/ac:129500
Gelman, Andrew E.; Hill, Jennifer; Yajima, Masanaohttp://hdl.handle.net/10022/AC:P:9795Wed, 12 Jan 2011 00:00:00 +0000Applied researchers often find themselves making statistical inferences in settings that would seem to require multiple comparisons adjustments. We challenge the Type I error paradigm that underlies these corrections. Moreover we posit that the problem of multiple comparisons can disappear entirely when viewed from a hierarchical Bayesian perspective. We propose building multilevel models in the settings where multiple comparisons arise. Multilevel models perform partial pooling (shifting estimates toward each other), whereas classical procedures typically keep the centers of intervals stationary, adjusting for multiple comparisons by making the intervals wider (or, equivalently, adjusting the p-values corresponding to intervals of fixed width). Thus, multilevel models address the multiple comparisons problem and also yield more efficient estimates, especially in settings with low group-level variation, which is where multiple comparisons are a particular concern.Statisticsag389Political Science, Statistics, Columbia Population Research CenterWorking papersReducing Bias in Treatment Effect Estimation in Observational Studies Suffering from Missing Data
http://academiccommons.columbia.edu/catalog/ac:129151
Hill, Jenniferhttp://hdl.handle.net/10022/AC:P:9697Wed, 18 Aug 2010 00:00:00 +0000Matching based on estimated propensity scores (that is, the estimated conditional probability of being treated) has become an increasingly popular technique for causal inference over the past decade. By balancing observed covariates, propensity score methods reduce the risk of confounding causal processes. Estimation of propensity scores in the complete data case is generally straightforward since it uses standard methods (e.g. logistic regression or discriminant analysis) and relies on diagnostics that are relatively easy to calculate and interpret. Most studies, however, have missing data. This paper illustrates a principled approach to handling missing data when estimating propensity scores makes use of multiple imputation (MI). Placing the problem within the framework of the Rubin Causal Model makes the assumptions explicit by illustrating the interaction between the treatment assignment mechanism and the missing data mechanism. Several approaches for estimating propensity scores with incomplete data using MI are presented. Results demonstrating improved efficacy compared with existing methodology are discussed. These advantages include greater bias reduction and increased facility in model choice.Social research, StatisticsInstitute for Social and Economic Research and PolicyWorking papersBayesian hierarchical classes analysis
http://academiccommons.columbia.edu/catalog/ac:125300
Leenen, Iwin; Mechelen, Iven van; Gelman, Andrew E.; Knop, Stijn dehttp://hdl.handle.net/10022/AC:P:8569Wed, 17 Mar 2010 00:00:00 +0000Hierarchical classes models are models for N-way N-mode data that represent the association among the N modes and simultaneously yield, for each mode, a hierarchical classification of its elements. In this paper we present a stochastic extension of the hierarchical classes model for two-way two-mode binary data. In line with the original model, the new probabilistic extension still represents both the association among the two modes and the hierarchical classifications. A fully Bayesian method for fitting the new model is presented and evaluated in a simulation study. Furthermore, we propose tools for model selection and model checking based on Bayes factors and posterior predictive checks. We illustrate the advantages of the new approach with applications in the domain of the psychology of choice and psychiatric diagnosis.Statisticsag389Political Science, StatisticsArticlesBayes: Radical, liberal, or conservative?
http://academiccommons.columbia.edu/catalog/ac:125306
Gelman, Andrew E.http://hdl.handle.net/10022/AC:P:8571Wed, 17 Mar 2010 00:00:00 +0000Statisticsag389Political Science, StatisticsArticlesRich state, poor state, red state, blue state: What's the matter with Connecticut?
http://academiccommons.columbia.edu/catalog/ac:125297
Gelman, Andrew E.; Shor, Boris; Bafumi, Joseph; Park, David K.http://hdl.handle.net/10022/AC:P:8568Wed, 17 Mar 2010 00:00:00 +0000For decades, the Democrats have been viewed as the party of the poor, with the Republicans representing the rich. Recent presidential elections, however, have shown a reverse pattern, with Democrats performing well in the richer blue states in the northeast and coasts, and Republicans dominating in the red states in the middle of the country and the south. Through multilevel modeling of individual-level survey data and county- and state-level demographic and electoral data, we reconcile these patterns. Furthermore, we find that income matters more in red America than in blue America. In poor states, rich people are much more likely than poor people to vote for the Republican presidential candidate, but in rich states (such as Connecticut), income has a very low correlation with vote preference.Political science, Statisticsag389Political Science, StatisticsArticlesStruggles with survey weighting and regression modeling
http://academiccommons.columbia.edu/catalog/ac:125309
Gelman, Andrew E.http://hdl.handle.net/10022/AC:P:8572Wed, 17 Mar 2010 00:00:00 +0000The general principles of Bayesian data analysis imply that models for survey responses should be constructed conditional on all variables that affect the probability of inclusion and nonresponse, which are also the variables used in survey weighting and clustering. However, such models can quickly become very complicated, with potentially thousands of poststratification cells. It is then a challenge to develop general families of multilevel probability models that yield reasonable Bayesian inferences. We discuss in the context of several ongoing public health and social surveys. This work is currently open-ended, and we conclude with thoughts on how research could proceed to solve these problems.Statisticsag389Political Science, StatisticsArticlesRejoinder: Struggles with survey weighting and regression modeling
http://academiccommons.columbia.edu/catalog/ac:125312
Gelman, Andrew E.http://hdl.handle.net/10022/AC:P:8573Wed, 17 Mar 2010 00:00:00 +0000I was motivated to write this paper, with its controversial opening line, "Survey weighting is a mess," from various experiences as an applied statistician.Statisticsag389Political Science, StatisticsArticlesComment: Bayesian Checking of the Second Levels of Hierarchical Models
http://academiccommons.columbia.edu/catalog/ac:125303
Gelman, Andrew E.http://hdl.handle.net/10022/AC:P:8570Wed, 17 Mar 2010 00:00:00 +0000Bayarri and Castellanos (BC) have written an interesting paper discussing two forms of posterior model check, one based on cross-validation and one based on replication of new groups in a hierarchical model. We think both these checks are good ideas and can become even more effective when understood in the context of posterior predictive checking. For the purpose of discussion, however, it is most interesting to focus on the areas where we disagree with BC.Statisticsag389Political Science, StatisticsArticlesPartisans without constraint: Political polarization and trends in American public opinion
http://academiccommons.columbia.edu/catalog/ac:125291
Baldassarri, Delia; Gelman, Andrew E.http://hdl.handle.net/10022/AC:P:8566Mon, 15 Mar 2010 00:00:00 +0000Public opinion polarization is here conceived as a process of alignment along multiple lines of potential disagreement and measured as growing constraint in individuals' preferences. Using NES data from 1972 to 2004, the authors model trends in issue partisanship--the correlation of issue attitudes with party identification--and issue alignment--the correlation between pairs of issues--and find a substantive increase in issue partisanship, but little evidence of issue alignment. The findings suggest that opinion changes correspond more to a resorting of party labels among voters than to greater constraint on issue attitudes: since parties are more polarized, they are now better at sorting individuals along ideological lines. Levels of constraint vary across population subgroups: strong partisans and wealthier and politically sophisticated voters have grown more coherent in their beliefs. The authors discuss the consequences of partisan realignment and group sorting on the political process and potential deviations from the classic pluralistic account of American politics.Political science, Statisticsag389Political Science, StatisticsArticlesPredicting and dissecting the seats-votes curve in the 2006 U.S. House election
http://academiccommons.columbia.edu/catalog/ac:125294
Kastellec, Jonathan P.; Gelman, Andrew E.; Chandler, Jamie P.http://hdl.handle.net/10022/AC:P:8567Mon, 15 Mar 2010 00:00:00 +0000The 2008 U.S. House elections mark the first time since 1994 that the Democrats will seek to retain a majority. With the political climate favoring Democrats this year, it seems almost certain that the party will retain control, and will likely increase its share of seats. In five national polls taken in June of this year, Democrats enjoyed on average a 13-point advantage in the generic congressional ballot; as Bafumi, Erikson, and Wlezien (2007) point out, these early polls, suitably adjusted, are good predictors of the November vote. As of late July, bettors at intrade.com put the probability of the Democrats retaining a majority at about 95% (Intrade.com 2008). Elsewhere in this symposium, Klarner (2008) predicts an 11-seat gain for the Democrats, while Lockerbie (2008) forecasts a 25-seat pickup. In this paper we document how the electoral playing field has shifted from a Republican advantage between 1996 and 2004 to a Democratic tilt today. In an earlier article (Kastellec, Gelman, and Chandler 2008), we predicted the seats-votes curve in the 2006 election, showing how the Democrats faced an uphill battle in their effort to take control of the House and, their victory notwithstanding, ended up winning a lower percentage of seats than their average district vote nationwide. We follow up on this analysis by using the same method to predict the seats-votes curve in 2008. Due to the shift in incumbency advantage from the Republicans to the Democrats, compounded by a greater number of retirements among Republican members, we show that the Democrats now enjoy a partisan bias, and can expect to win more seats than votes for the first time since 1992. While this bias is not as large as the advantage the Republicans held in 2006, it will likely help the Democrats increase their share of seats.Statistics, Political sciencejpk2004, ag389Political Science, StatisticsArticlesDiscussion of the Article "Website Morphing"
http://academiccommons.columbia.edu/catalog/ac:125288
Gelman, Andrew E.http://hdl.handle.net/10022/AC:P:8565Mon, 15 Mar 2010 00:00:00 +0000The article under discussion illustrates the trade-off between optimization and exploration that is fundamental to statistical experimental design. In this discussion, I suggest that the research under discussion could be made even more effective by checking the fit of the model by comparing observed data to replicated data sets simulated from the fitted model.Statisticsag389Political Science, StatisticsArticlesThe playing field shifts: Predicting the seats-votes curve in the 2008 U.S. House election
http://academiccommons.columbia.edu/catalog/ac:125285
Kastellec, Jonathan P.; Gelman, Andrew E.; Chandler, Jamie P.http://hdl.handle.net/10022/AC:P:8564Mon, 15 Mar 2010 00:00:00 +0000The 2008 U.S. House elections mark the first time since 1994 that the Democrats will seek to retain a majority. With the political climate favoring Democrats this year, it seems almost certain that the party will retain control, and will likely increase its share of seats. In five national polls taken in June of this year, Democrats enjoyed on average a 13-point advantage in the generic congressional ballot; as Bafumi, Erikson, and Wlezien (2007) point out, these early polls, suitably adjusted, are good predictors of the November vote. As of late July, bettors at intrade.com put the probability of the Democrats retaining a majority at about 95% (Intrade.com 2008). Elsewhere in this symposium, Klarner (2008) predicts an 11-seat gain for the Democrats, while Lockerbie (2008) forecasts a 25-seat pickup. In this paper we document how the electoral playing field has shifted from a Republican advantage between 1996 and 2004 to a Democratic tilt today. In an earlier article (Kastellec, Gelman, and Chandler 2008), we predicted the seats-votes curve in the 2006 election, showing how the Democrats faced an uphill battle in their effort to take control of the House and, their victory notwithstanding, ended up winning a lower percentage of seats than their average district vote nationwide. We follow up on this analysis by using the same method to predict the seats-votes curve in 2008. Due to the shift in incumbency advantage from the Republicans to the Democrats, compounded by a greater number of retirements among Republican members, we show that the Democrats now enjoy a partisan bias, and can expect to win more seats than votes for the first time since 1992. While this bias is not as large as the advantage the Republicans held in 2006, it will likely help the Democrats increase their share of seats.Political science, Statisticsjpk2004, ag389Political Science, StatisticsArticles