Academic Commons Search Results
http://academiccommons.columbia.edu/catalog.rss?f%5Bdepartment_facet%5D%5B%5D=Statistics&f%5Blanguage%5D%5B%5D=English&f%5Bsubject_facet%5D%5B%5D=Statistics&q=&rows=500&sort=record_creation_date+desc
Academic Commons Search Resultsen-usOn Model-Selection and Applications of Multilevel Models in Survey and Causal Inference
http://academiccommons.columbia.edu/catalog/ac:200369
Wang, Weihttp://dx.doi.org/10.7916/D8571C4QWed, 22 Jun 2016 12:34:52 +0000This thesis includes three parts. The overarching theme is how to analyze multilevel structured datasets, particularly in the areas of survey and causal inference. The first part discusses model selection of hierarchical models, in the context of a national political survey. I found that the commonly used model selection criteria based on predictive accuracy, such as cross validation, don't perform very well in the case of political survey and explore the possible causes. The second part centers around a unique data set on the presidential election collected through an online platform. I show that with adequate modeling, meaningful and highly accurate information could be extracted from this highly-biased data set. The third part builds on a formal causal inference framework for group-structured data, such as meta-analysis and multi-site trials. In particular, I develop a Gaussian Process model under this framework and demonstrate additional insights that can be gained compared with traditional parametric models.Statisticsww2243StatisticsDissertationsAsymptotic Theory and Applications of Random Functions
http://academiccommons.columbia.edu/catalog/ac:198322
Li, Xiaoouhttp://dx.doi.org/10.7916/D8QF8SW7Tue, 03 May 2016 09:21:26 +0000Random functions is the central component in many statistical and probabilistic problems. This dissertation presents theoretical analysis and computation for random functions and its applications in statistics. This dissertation consists of two parts. The first part is on the topic of classic continuous random fields. We present asymptotic analysis and computation for three non-linear functionals of random fields. In Chapter 1, we propose an efficient Monte Carlo algorithm for computing P{sup_T f(t)>b} when b is large, and f is a Gaussian random field living on a compact subset T. For each pre-specified relative error ɛ, the proposed algorithm runs in a constant time for an arbitrarily large $b$ and computes the probability with the relative error ɛ. In Chapter 2, we present the asymptotic analysis for the tail probability of ∫_T e^{σf(t)+μ(t)}dt under the asymptotic regime that σ tends to zero. In Chapter 3, we consider partial differential equations (PDE) with random coefficients, and we develop an unbiased Monte Carlo estimator with finite variance for computing expectations of the solution to random PDEs. Moreover, the expected computational cost of generating one such estimator is finite. In this analysis, we employ a quadratic approximation to solve random PDEs and perform precise error analysis of this numerical solver. The second part of this dissertation focuses on topics in statistics. The random functions of interest are likelihood functions, whose maximum plays a key role in statistical inference. We present asymptotic analysis for likelihood based hypothesis tests and sequential analysis. In Chapter 4, we derive an analytical form for the exponential decay rate of error probabilities of the generalized likelihood ratio test for testing two general families of hypotheses. In Chapter 5, we study asymptotic properties of the generalized sequential probability ratio test, the stopping rule of which is the first boundary crossing time of the generalized likelihood ratio statistic. We show that this sequential test is asymptotically optimal in the sense that it achieves asymptotically the shortest expected sample size as the maximal type I and type II error probabilities tend to zero. These results have important theoretical implications in hypothesis testing, model selection, and other areas where maximum likelihood is employed.Statisticsxl2306StatisticsDissertationsSpectral Filtering for Spatio-temporal Dynamics and Multivariate Forecasts
http://academiccommons.columbia.edu/catalog/ac:198310
Meng, Luhttp://dx.doi.org/10.7916/D80Z7385Tue, 03 May 2016 09:20:21 +0000Due to the increasing availability of massive spatio-temporal data sets, modeling high dimensional data becomes quite challenging. A large number of research questions are rooted in identifying underlying dynamics in such spatio-temporal data. For many applications, the science suggests that the intrinsic dynamics be smooth and of low dimension. To reduce the variance of estimates and increase the computational tractability, dimension reduction is also quite necessary in the modeling procedure. In this dissertation, we propose a spectral filtering approach for dimension reduction and forecast amelioration, and apply it to multiple applications. We show the effectiveness of dimension reduction via our method and also illustrate its power for prediction in both simulation and real data examples. The resultant lower dimensional principal component series has a diagonal spectral density at each frequency whose diagonal elements are in descending order, which is not well motivated can be hard to interpret. Therefore we propose a phase-based filtering method to create principal component series with interpretable dynamics in the time domain. Our method is based on an approach of structural decomposition and phase-aligned construction in the frequency domain, identifying lower-rank dynamics and its components embedded in a high dimensional spatio-temporal system. In both our simulated examples and real data applications, we illustrate that the proposed method is able to separate and identify meaningful lower-rank movements. Benefiting from the zero-coherence property of the principal component series, we subsequently develop a predictive model for high-dimensional forecasting via lower-rank dynamics. Our modeling approach reduces multivariate modeling task to multiple univariate modeling and is flexible in combining with regularization techniques to obtain more stable estimates and improve interpretability. The simulation results and real data analysis show that our model achieves superior forecast performance compared to the class of autoregressive models.Statisticslm2844StatisticsDissertationsLatent Variable Modeling and Statistical Learning
http://academiccommons.columbia.edu/catalog/ac:198122
Chen, Yunxiaohttp://dx.doi.org/10.7916/D8PV6KBNFri, 29 Apr 2016 21:15:12 +0000Latent variable models play an important role in psychological and educational measurement, which attempt to uncover the underlying structure of responses to test items. This thesis focuses on the development of statistical learning methods based on latent variable models, with applications to psychological and educational assessments. In that connection, the following problems are considered.
The first problem arises from a key assumption in latent variable modeling, namely the local independence assumption, which states that given an individual's latent variable (vector), his/her responses to items are independent. This assumption is likely violated in practice, as many other factors, such as the item wording and question order, may exert additional influence on the item responses. Any exploratory analysis that relies on this assumption may result in choosing too many nuisance latent factors that can neither be stably estimated nor reasonably interpreted. To address this issue, a family of models is proposed that relax the local independence assumption by combining the latent factor modeling and graphical modeling. Under this framework, the latent variables capture the across-the-board dependence among the item responses, while a second graphical structure characterizes the local dependence. In addition, the number of latent factors and the sparse graphical structure are both unknown and learned from data, based on a statistically solid and computationally efficient method.
The second problem is to learn the relationship between items and latent variables, a structure that is central to multidimensional measurement. In psychological and educational assessments, this relationship is typically specified by experts when items are written and is incorporated into the model without further verification after data collection. Such a non-empirical approach may lead to model misspecification and substantial lack of model fit, resulting in erroneous interpretation of assessment results. Motivated by this, I consider to learn the item - latent variable relationship based on data. It is formulated as a latent variable selection problem, for which theoretical analysis and a computationally efficient algorithm are provided.Statisticsyc2710StatisticsDissertationsAdvances in Model Selection Techniques with Applications to Statistical Network Analysis and Recommender Systems
http://academiccommons.columbia.edu/catalog/ac:198116
Franco Saldana, Diegohttp://dx.doi.org/10.7916/D8GB2424Fri, 29 Apr 2016 21:14:50 +0000This dissertation focuses on developing novel model selection techniques, the process by which a statistician selects one of a number of competing models of varying dimensions, under an array of different statistical assumptions on observed data. Traditionally, two main reasons have been advocated by researchers for performing model selection strategies over classical maximum likelihood estimates (MLEs). The first reason is prediction accuracy, where by shrinking or setting to zero some model parameters, one sacrifices the unbiasedness of MLEs for a reduced variance, which in turn leads to an overall improvement in predictive performance. The second reason relates to interpretability of the selected models in the presence of a large number of predictors, where in order to obtain a parsimonious representation exhibiting the relationship between the response and covariates, we are willing to sacrifice some of the smaller details brought in by spurious predictors.
In the first part of this work, we revisit the family of variable selection techniques known as sure independence screening procedures for generalized linear models and the Cox proportional hazards model. After clever combination of some of its most powerful variants, we propose new extensions based on the idea of sample splitting, data-driven thresholding, and combinations thereof. A publicly available package developed in the R statistical software demonstrates considerable improvements in terms of model selection and competitive computational time between our enhanced variable selection procedures and traditional penalized likelihood methods applied directly to the full set of covariates.
Next, we develop model selection techniques within the framework of statistical network analysis for two frequent problems arising in the context of stochastic blockmodels: community number selection and change-point detection. In the second part of this work, we propose a composite likelihood based approach for selecting the number of communities in stochastic blockmodels and its variants, with robustness consideration against possible misspecifications in the underlying conditional independence assumptions of the stochastic blockmodel. Several simulation studies, as well as two real data examples, demonstrate the superiority of our composite likelihood approach when compared to the traditional Bayesian Information Criterion or variational Bayes solutions. In the third part of this thesis, we extend our analysis on static network data to the case of dynamic stochastic blockmodels, where our model selection task is the segmentation of a time-varying network into temporal and spatial components by means of a change-point detection hypothesis testing problem. We propose a corresponding test statistic based on the idea of data aggregation across the different temporal layers through kernel-weighted adjacency matrices computed before and after each candidate change-point, and illustrate our approach on synthetic data and the Enron email corpus.
The matrix completion problem consists in the recovery of a low-rank data matrix based on a small sampling of its entries. In the final part of this dissertation, we extend prior work on nuclear norm regularization methods for matrix completion by incorporating a continuum of penalty functions between the convex nuclear norm and nonconvex rank functions. We propose an algorithmic framework for computing a family of nonconvex penalized matrix completion problems with warm-starts, and present a systematic study of the resulting spectral thresholding operators. We demonstrate that our proposed nonconvex regularization framework leads to improved model selection properties in terms of finding low-rank solutions with better predictive performance on a wide range of synthetic data and the famous Netflix data recommender system.Statisticsdf2406StatisticsDissertationsNew perspectives on learning, inference, and control in brains and machines
http://academiccommons.columbia.edu/catalog/ac:196425
Merel, Joshua Scotthttp://dx.doi.org/10.7916/D8C8296CWed, 16 Mar 2016 18:35:32 +0000The work presented in this thesis provides new perspectives and approaches for problems that arise in the analysis of neural data. Particular emphasis is placed on parameter fitting and automated analysis problems that would arise naturally in closed-loop experiments. Part one focuses on two brain-computer interface problems. First, we provide a framework for understanding co-adaptation, the setting in which decoder updating and user learning occur simultaneously. We also provide a new perspective on intention-based parameter fitting and tools to extend this approach to higher dimensional decoders. Part two focuses on event inference, which refers to the decomposition of observed timeseries data into interpretable events. We present application of event inference methods on voltage-clamp recordings as well as calcium imaging, and describe extensions to allow for combining data across modalities or trials.Neurosciences, Statisticsjsm2183Neurobiology and Behavior, StatisticsDissertationsMethods for Personalized and Evidence Based Medicine
http://academiccommons.columbia.edu/catalog/ac:195007
Shahn, Zachhttp://dx.doi.org/10.7916/D8M0458SWed, 24 Feb 2016 21:14:26 +0000There is broad agreement that medicine ought to be `evidence based' and `personalized' and that data should play a large role in achieving both these goals. But the path from data to improved medical decision making is not clear. This thesis presents three methods that hopefully help in small ways to clear the path.
Personalized medicine depends almost entirely on understanding variation in treatment effect. Chapter 1 describes latent class mixture models for treatment effect heterogeneity that distinguish between continuous and discrete heterogeneity, use hierarchical shrinkage priors to mitigate overfitting and multiple comparisons concerns, and employ flexible error distributions to improve robustness. We apply different versions of these models to reanalyze a clinical trial comparing HIV treatments and a natural experiment on the effect of Medicaid on emergency department utilization.
Medical decisions often depend on observational studies performed on large longitudinal health insurance claims databases. These studies usually claim to identify a causal effect, but empirical evaluations have demonstrated that standard methods for causal discovery perform poorly in this context, most likely in large part due to the presence of unobserved confounding. Chapter 2 proposes an algorithm called Ensembles of Granger Graphs (EGG) that does not rely on the assumption that unobserved confounding is absent. In a simulation and experiments on a real claims database, EGG is robust to confounding, has high positive predictive value, and has high power to detect strong causal effects.
While decision making inherently involves causal inference, purely predictive models aid many medical decisions in practice. Predictions from health histories are challenging because the space of possible predictors is so vast. Not only are there thousands of health events to consider, but also their temporal interactions. In Chapter 3, we adapt a method originally developed for speech recognition that greedily constructs informative labeled graphs representing temporal relations between multiple health events at the nodes of randomized decision trees. We use this method to predict strokes in patients with atrial fibrillation using data from a Medicaid claims database.
I hope the ideas illustrated in these three projects inspire work that someday genuinely improves healthcare. I also include a short `bonus' chapter on an improved estimate of effective sample size in importance sampling. This chapter is not directly related to medicine, but finds a home in this thesis nonetheless.Statisticszss2101StatisticsDissertationsDistributed Bayesian Computation and Self-Organized Learning in Sheets of Spiking Neurons with Local Lateral Inhibition
http://academiccommons.columbia.edu/catalog/ac:192253
Bill, Johannes; Buesing, Lars; Habenschuss, Stefan; Nessler, Bernhard; Maass, Wolfgang; Legenstein, Roberthttp://dx.doi.org/10.7916/D8862G4XMon, 14 Dec 2015 10:04:07 +0000During the last decade, Bayesian probability theory has emerged as a framework in cognitive science and neuroscience for describing perception, reasoning and learning of mammals. However, our understanding of how probabilistic computations could be organized in the brain, and how the observed connectivity structure of cortical microcircuits supports these calculations, is rudimentary at best. In this study, we investigate statistical inference and self-organized learning in a spatially extended spiking network model, that accommodates both local competitive and large-scale associative aspects of neural information processing, under a unified Bayesian account. Specifically, we show how the spiking dynamics of a recurrent network with lateral excitation and local inhibition in response to distributed spiking input, can be understood as sampling from a variational posterior distribution of a well-defined implicit probabilistic model. This interpretation further permits a rigorous analytical treatment of experience-dependent plasticity on the network level. Using machine learning theory, we derive update rules for neuron and synapse parameters which equate with Hebbian synaptic and homeostatic intrinsic plasticity rules in a neural implementation. In computer simulations, we demonstrate that the interplay of these plasticity rules leads to the emergence of probabilistic local experts that form distributed assemblies of similarly tuned cells communicating through lateral excitatory connections. The resulting sparse distributed spike code of a well-adapted network carries compressed information on salient input features combined with prior experience on correlations among them. Our theory predicts that the emergence of such efficient representations benefits from network architectures in which the range of local inhibition matches the spatial extent of pyramidal cells that share common afferent input.Neurosciences, Molecular biology, StatisticsStatisticsArticlesAn Assortment of Unsupervised and Supervised Applications to Large Data
http://academiccommons.columbia.edu/catalog/ac:189937
Agne, Michael Roberthttp://dx.doi.org/10.7916/D828073NThu, 15 Oct 2015 18:08:52 +0000This dissertation presents several methods that can be applied to large datasets with an enormous number of covariates. It is divided into two parts. In the first part of the dissertation, a novel approach to pinpointing sets of related variables is introduced. In the second part, several new methods and modifications of current methods designed to improve prediction are outlined. These methods can be considered extensions of the very successful I Score suggested by Lo and Zheng in a 2002 paper and refined in many papers since.
In Part I, unsupervised data (with no response) is addressed. In chapter 2, the novel unsupervised I score and its associated procedure are introduced and some of its unique theoretical properties are explored. In chapter 3, several simulations consisting of generally hard-to-wrangle scenarios demonstrate promising behavior of the approach. The method is applied to the complex field of market basket analysis, with a specific grocery data set used to show it in action in chapter 4. It is compared it to a natural competition, the A Priori algorithm. The main contribution of this part of the dissertation is the unsupervised I score, but we also suggest several ways to leverage the variable sets the I score locates in order to mine for association rules.
In Part II, supervised data is confronted. Though the I Score has been used in reference to these types of data in the past, several interesting ways of leveraging it (and the modules of covariates it identifies) are investigated. Though much of this methodology adopts procedures which are individually well-established in literature, the contribution of this dissertation is organization and implementation of these methods in the context of the I Score. Several module-based regression and voting methods are introduced in chapter 7, including a new LASSO-based method for optimizing voting weights. These methods can be considered intuitive and readily applicable to a huge number of datasets of sometimes colossal size. In particular, in chapter 8, a large dataset on Hepatitis and another on Oral Cancer are analyzed. The results for some of the methods are quite promising and competitive with existing methods, especially with regard to prediction. A flexible and multifaceted procedure is suggested in order to provide a thorough arsenal when dealing with the problem of prediction in these complex data sets.
Ultimately, we highlight some benefits and future directions of the method.Statistics, Biostatisticsmra2110StatisticsDissertationsEfficiency in Lung Transplant Allocation Strategies
http://academiccommons.columbia.edu/catalog/ac:187899
Zou, Jingjinghttp://dx.doi.org/10.7916/D8QV3KKZTue, 12 May 2015 18:28:18 +0000Currently in the United States, lungs are allocated to transplant candidates based on the Lung Allocation Score (LAS). The LAS is an empirically derived score aimed at increasing total life span pre- and post-transplantation, for patients on lung transplant waiting lists. The goal here is to develop efficient allocation strategies in the context of lung transplantation.
In this study, patient and organ arrivals to the waiting list are modeled as independent homogeneous Poisson processes. Patients' health status prior to allocations are modeled as evolving according to independent and identically distributed finite-state inhomogeneous Markov processes, in which death is treated as an absorbing state. The expected post-transplantation residual life is modeled as depending on time on the waiting list and on current health status. For allocation strategies satisfying certain minimal fairness requirements, the long-term limit of expected average total life exists, and is used as the standard for comparing allocation strategies.
Via the Hamilton-Jacobi-Bellman equations, upper bounds as a function of the ratio of organ arrival rate to the patient arrival rate for the long-term expected average total life are derived, and corresponding to each upper bound is an allocable set of (state, time) pairs at which patients would be optimally transplanted. As availability of organs increases, the allocable set expands monotonically, and ranking members of the waiting list according to the availability at which they enter the allocable set provides an allocation strategy that leads to long-term expected average total life close to the upper bound.
Simulation studies are conducted with model parameters estimated from national lung transplantation data from United Network for Organ Sharing (UNOS). Results suggest that compared to the LAS, the proposed allocation strategy could provide a 7% increase in average total life.Statisticsjz2335StatisticsDissertationsGLMLE: graph-limit enabled fast computation for fitting exponential random graph models to large social networks
http://academiccommons.columbia.edu/catalog/ac:185410
He, Ran; Zheng, Tianhttp://dx.doi.org/10.7916/D8S46QVQThu, 02 Apr 2015 00:00:00 +0000Large network, as a form of big data, has received increasing amount of attention in data science, especially for large social network, which is reaching the size of hundreds of millions, with daily interactions on the scale of billions. Thus analyzing and modeling these data to understand the connectivities and dynamics of large networks is important in a wide range of scientific fields. Among popular models, exponential random graph models (ERGMs) have been developed to study these complex networks by directly modeling network structures and features. ERGMs, however, are hard to scale to large networks because maximum likelihood estimation of parameters in these models can be very difficult, due to the unknown normalizing constant. Alternative strategies based on Markov chain Monte Carlo (MCMC) draw samples to approximate the likelihood, which is then maximized to obtain the maximum likelihood estimators (MLE). These strategies have poor convergence due to model degeneracy issues and cannot be used on large networks. Chatterjee et al. (Ann Stat 41:2428–2461, 2013) propose a new theoretical framework for estimating the parameters of ERGMs by approximating the normalizing constant using the emerging tools in graph theory—graph limits. In this paper, we construct a complete computational procedure built upon their results with practical innovations which is fast and is able to scale to large networks. More specifically, we evaluate the likelihood via simple function approximation of the corresponding ERGM’s graph limit and iteratively maximize the likelihood to obtain the MLE. We also discuss the methods of conducting likelihood ratio test for ERGMs as well as related issues. Through simulation studies and real data analysis of two large social networks, we show that our new method outperforms the MCMC-based method, especially when the network size is large (more than 100 nodes). One limitation of our approach, inherited from the limitation of the result of Chatterjee et al. (Ann Stat 41:2428–2461, 2013), is that it works only for sequences of graphs with a positive limiting density, i.e., dense graphs.Statisticsrh2528, tz33StatisticsArticlesHow many people do you know?: Efficiently estimating personal network size
http://academiccommons.columbia.edu/catalog/ac:185367
Zheng, Tian; Salganik, Matthew J.; McCormick, Tyler H.http://dx.doi.org/10.7916/D8FX78BTTue, 31 Mar 2015 00:00:00 +0000In this paper we develop a method to estimate both individual social network size (i.e., degree) and the distribution of network sizes in a population by asking respondents how many people they know in specific subpopulations (e.g., people named Michael). Building on the scale-up method of Killworth et al. and other previous attempts to estimate individual network size, we propose a latent non-random mixing model which resolves three known problems with previous approaches. As a byproduct, our method also provides estimates of the rate of social mixing between population groups. We demonstrate the model using a sample of 1,370 adults originally collected by McCarty et al. (2001). Based on insights developed during the statistical modeling, we conclude by offering practical guidelines for the design of future surveys to estimate social network size. Most importantly, we show that if the first names to be asked about are chosen properly, the simple scale-up degree estimates can enjoy the same bias-reduction as that from the our more complex latent non-random mixing model.Statistics, Social researchtz33StatisticsArticlesSurveying Hard-to-Reach Groups Through Sampled Respondents in a Social Network
http://academiccommons.columbia.edu/catalog/ac:185373
McCormick, Tyler H.; Zheng, Tian; He, Ran; Kolaczyk, Erichttp://dx.doi.org/10.7916/D8Z0372NTue, 31 Mar 2015 00:00:00 +0000The sampling frame in most social science surveys misses members of certain groups, such as the homeless or individuals living with HIV. These groups are known as hard-to-reach groups. One strategy for learning about these groups, or subpopulations, involves reaching hard-to-reach group members through their social network. In this paper we compare the efficiency of two common methods for subpopulation size estimation using data from standard surveys. These designs are examples of mental link tracing designs. These designs begin with a randomly sampled set of network members (nodes) and then reach other nodes indirectly through questions asked to the sampled nodes. Mental link tracing designs cost significantly less than traditional link tracing designs, yet introduce additional sources of potential bias. We examine the influence of one such source of bias using simulation studies. We then demonstrate our findings using data from the General Social Survey collected in 2004 and 2006. Additionally, we provide survey design suggestions for future surveys incorporating such designs.Statistics, Social researchtz33, rh2528StatisticsArticlesA Practical Guide to Measuring Social Structure Using Indirectly Observed Network Data
http://academiccommons.columbia.edu/catalog/ac:185370
McCormick, Tyler H.; Moussa, Amal; DiPrete, Thomas A.; Ruf, Johannes; Gelman, Andrew E.; Teitler, Julien O.; Zheng, Tianhttp://dx.doi.org/10.7916/D86H4G9DTue, 31 Mar 2015 00:00:00 +0000Aggregated relational data (ARD) are an increasingly common tool for learning about social networks through standard surveys. Recent statistical advances present social scientists with new options for analyzing such data. In this article, we propose guidelines for learning about various network processes using ARD and a template to aid practitioners. We first propose that ARD can be used to measure “social distance” between a respondent and a subpopulation (individuals named Kevin, those in prison, or those serving in the military). We then present common methods for analyzing these data and associate each of these methods with a specific way of measuring social distance, thus associating statistical tools with their underlying social science phenomena. We examine the implications of using each of these social distance measures using an Internet survey about contemporary political issues.Statistics, Social researchtad61, ag389, jot8, tz33Sociology, Statistics, Social WorkArticlesHow Many People Do You Know in Prison? Using Overdispersion in Count Data to Estimate Social Structure in Networks
http://academiccommons.columbia.edu/catalog/ac:185364
Zheng, Tian; Salganik, Matthew J.; Gelman, Andrew E.http://dx.doi.org/10.7916/D800011WMon, 30 Mar 2015 00:00:00 +0000Networks—sets of objects connected by relationships—are important in a number of fields. The study of networks has long been central to sociology, where researchers have attempted to understand the causes and consequences of the structure of relationships in large groups of people. Using insight from previous network research, Killworth et al. and McCarty et al. have developed and evaluated a method for estimating the sizes of hard-to-count populations using network data collected from a simple random sample of Americans. In this article we show how, using a multilevel overdispersed Poisson regression model, these data also can be used to estimate aspects of social structure in the population. Our work goes beyond most previous research on networks by using variation, as well as average responses, as a source of information. We apply our method to the data of McCarty et al. and find that Americans vary greatly in their number of acquaintances. Further, Americans show great variation in propensity to form ties to people in some groups (e.g., males in prison, the homeless, and American Indians), but little variation for other groups (e.g., twins, people named Michael or Nicole). We also explore other features of these data and consider ways in which survey data can be used to estimate network structure.Statistics, Social researchtz33, ag389Political Science, StatisticsArticlesBackward Genotype-Trait Association (BGTA)-Based Dissection of Complex Traits in Case-Control Designs
http://academiccommons.columbia.edu/catalog/ac:185325
Zheng, Tian; Wang, Hui; Lo, Shaw-Hwahttp://dx.doi.org/10.7916/D8SF2V33Mon, 30 Mar 2015 00:00:00 +0000Background: The studies of complex traits project new challenges to current methods that evaluate association between genotypes and a specific trait. Consideration of possible interactions among loci leads to overwhelming dimensions that cannot be handled using current statistical methods. Methods: In this article, we evaluate a multi-marker screening algorithm--the backward genotype-trait association (BGTA) algorithm for case-control designs, which uses unphased multi-locus genotypes. BGTA carries out a global investigation on a candidate marker set and automatically screens out markers carrying diminutive amounts of information regarding the trait in question. To address the "too many possible genotypes, too few informative chromosomes" dilemma of a genomic-scale study that consists of hundreds to thousands of markers, we further investigate a BGTA-based marker selection procedure, in which the screening algorithm is repeated on a large number of random marker subsets. Results of these screenings are then aggregated into counts that the markers are retained by the BGTA algorithm. Markers with exceptional high counts of returns are selected for further analysis. Results and Conclusion: Evaluated using simulations under several disease models, the proposed methods prove to be more powerful in dealing with epistatic traits.We also demonstrate the proposed methods through an application to a study on the inflammatory bowel disease.Statistics, Genetics, Biostatisticstz33, hw2334, shl5Microbiology and Immunology, Statistics, BiostatisticsArticlesComment: Quantifying the Fraction of Missing Information for Hypothesis Testing in Statistical and Genetic Studies
http://academiccommons.columbia.edu/catalog/ac:184983
Zheng, Tian; Lo, Shaw-Hwahttp://dx.doi.org/10.7916/D84T6H8MSat, 28 Mar 2015 00:00:00 +0000The authors suggest an interesting way to measure the fraction of missing information in the context of hypothesis testing. The measure seeks to quantify the impact of missing observations on the test between two hypotheses. The amount of impact can be useful information for applied research. An example is, in genetics, where multiple tests of the same sort are performed on different variables with different missing rates, and follow-up studies may be designed to resolve missing values in selected variables. In this discussion, we offer our prospective views on the use of relative information in a follow-up study. For studies where the impact of missing observations varies greatly across different variables and where the investigators have the flexibility of designing studies that can have different efforts on variables, an optimal design may be derived using relative information measures to improve the cost-effectiveness of the followup.Statisticstz33, shl5StatisticsArticlesOn Bootstrap Tests of Symmetry About an Unknown Median
http://academiccommons.columbia.edu/catalog/ac:184965
Zheng, Tian; Gastwirth, Joseph L.http://dx.doi.org/10.7916/D8X9296PFri, 27 Mar 2015 00:00:00 +0000It is important to examine the symmetry of an underlying distribution before applying some statistical procedures to a data set. For example, in the Zuni School District case, a formula originally developed by the Department of Education trimmed 5% of the data symmetrically from each end. The validity of this procedure was questioned at the hearing by Chief Justice Roberts. Most tests of symmetry (even nonparametric ones) are not distribution free in finite sample sizes. Hence, using asymptotic distribution may not yield an accurate type I error rate or/and loss of power in small samples. Bootstrap resampling from a symmetric empirical distribution function fitted to the data is proposed to improve the accuracy of the calculated p-value of several tests of symmetry. The results show that the bootstrap method is superior to previously used approaches relying on the asymptotic distribution of the tests that assumed the data come from a normal distribution. Incorporating the bootstrap estimate in a recently proposed test due to Miao, Gel and Gastwirth (2006) preserved its level and shows it has reasonable power properties on the family of distribution evaluated.Statisticstz33StatisticsArticlesDiscovering influential variables: A method of partitions
http://academiccommons.columbia.edu/catalog/ac:184953
Chernoff, Herman; Lo, Shaw-Hwa; Zheng, Tianhttp://dx.doi.org/10.7916/D8PR7TVMFri, 27 Mar 2015 00:00:00 +0000A trend in all scientific disciplines, based on advances in technology, is the increasing availability of high dimensional data in which are buried important information. A current urgent challenge to statisticians is to develop effective methods of finding the useful information from the vast amounts of messy and noisy data available, most of which are noninformative. This paper presents a general computer intensive approach, based on a method pioneered by Lo and Zheng for detecting which, of many potential explanatory variables, have an influence on a dependent variable Y. This approach is suited to detect influential variables, where causal effects depend on the confluence of values of several variables. It has the advantage of avoiding a difficult direct analysis, involving possibly thousands of variables, by dealing with many randomly selected small subsets from which smaller subsets are selected, guided by a measure of influence I. The main objective is to discover the influential variables, rather than to measure their effects. Once they are detected, the problem of dealing with a much smaller group of influential variables should be vulnerable to appropriate analysis. In a sense, we are confining our attention to locating a few needles in a haystack.Statistics, Computer scienceshl5, tz33StatisticsArticlesLatent demographic profile estimation in hard-to-reach groups
http://academiccommons.columbia.edu/catalog/ac:184956
McCormick, Tyler H.; Zheng, Tianhttp://dx.doi.org/10.7916/D8F76BFQFri, 27 Mar 2015 00:00:00 +0000The sampling frame in most social science surveys excludes members of certain groups, known as hard-to-reach groups. These groups, or subpopulations, may be difficult to access (the homeless, e.g.), camouflaged by stigma (individuals with HIV/AIDS), or both (commercial sex workers). Even basic demographic information about these groups is typically unknown, especially in many developing nations. We present statistical models which leverage social network structure to estimate demographic characteristics of these subpopulations using Aggregated relational data (ARD), or questions of the form “How many X’s do you know?” Unlike other network-based techniques for reaching these groups, ARD require no special sampling strategy and are easily incorporated into standard surveys. ARD also do not require respondents to reveal their own group membership. We propose a Bayesian hierarchical model for estimating the demographic characteristics of hard-to-reach groups, or latent demographic profiles, using ARD. We propose two estimation techniques. First, we propose a Markov-chain Monte Carlo algorithm for existing data or cases where the full posterior distribution is of interest. For cases when new data can be collected, we propose guidelines and, based on these guidelines, propose a simple estimate motivated by a missing data approach. Using data from McCarty et al. [Human Organization 60 (2001) 28–39], we estimate the age and gender profiles of six hard-to-reach groups, such as individuals who have HIV, women who were raped, and homeless persons. We also evaluate our simple estimates using simulation studies.Statisticstz33StatisticsArticlesOn Identifying Rare Variants for Complex Human Traits
http://academiccommons.columbia.edu/catalog/ac:197118
Fan, Ruixuehttp://dx.doi.org/10.7916/D8N29VT4Mon, 16 Mar 2015 12:24:34 +0000This thesis focuses on developing novel statistical tests for rare variants association analysis incorporating both marginal effects and interaction effects among rare variants. Compared with common variants, rare variants have lower minor allele frequencies (typically less than 5%), and hence traditional association tests for common variants will lose power for rare variants. Therefore, there is a pressing need of new analytical tools to tackle the problem of rare variants association with complex human traits. Several collapsing methods have been proposed that aggregate information of rare variants in a region and test them together. They can be divided into burden tests and non-burden tests based on their aggregation strategies. They are all variations of regression-based methods with the assumption that the phenotype is associated with the genotype via a (linear) regression model. Most of these methods consider only marginal effects of rare variants and fail to take into account gene-gene and gene-environmental interactive effects, which are ubiquitous and are of utmost importance in biological systems. In this thesis, we propose a summation of partition approach (SPA) -- a nonparametric strategy for rare variants association analysis. Extensive simulation studies show that SPA is powerful in detecting not only marginal effects but also gene-gene interaction effects of rare variants. Moreover, extensions of SPA are able to detect gene-environment interactions and other interactions existing in complicated biological system as well. We are also able to obtain the asymptotic behavior of the marginal SPA score, which guarantees the power of the proposed method. Inspired by the idea of stepwise variable selection, a significance-based backward dropping algorithm(SDA) is proposed to locate truly influential rare variants in a genetic region that has been identified significant. Unlike traditional backward dropping approaches which remove the least significant variables first, SDA introduces the idea of eliminating the most significant variable at each round. The removed variables are collected and their effects are evaluated by an influence ratio score -- the relative p-value change. Our simulation studies show that SDA is powerful to detect causal variables and SDA has lower false discovery rate than LASSO. We also demonstrate our method using the dataset provided by Genetic Analysis Workshop (GAW) 17 and the results support the superiority of SDA over LASSO. The general partition-retention framework can also be applied to detect gene-environmental interaction effects for common variants. We demonstrate this method using the dataset from Genetic Analysis Workshop (GAW) 18. Our nonparametric approach is able to identify a lot more possible influential gene-environmental pairs than traditional linear regression models. We propose in this thesis a "SPA-SDA" two step approach for rare variants association analysis at genomic scale: first identify significant regions of moderate sizes using SPA, and then apply SDA to the identified regions to pinpoint truly influential variables. This approach is computationally efficient for genomic data and it has the capacity to detect gene-gene and gene-environmental interactions.Statistics, Bioinformaticsrf2283StatisticsDissertationsSPAr package for Fan and Lo (2013) "A robust model-free approach for rare variants association studies incorporating gene-gene and gene-environmental interactions."
http://academiccommons.columbia.edu/catalog/ac:179424
Fan, Ruixue; Lo, Shaw-Hwahttp://dx.doi.org/10.7916/D84Q7SN6Fri, 07 Nov 2014 00:00:00 +0000Recently more and more evidence suggest that rare variants with much lower minor allele frequencies play significant roles in disease etiology. Advances in next-generation sequencing technologies will lead to many more rare variants association studies. Several statistical methods have been proposed to assess the effect of rare variants by aggregating information from multiple loci across a genetic region and testing the association between the phenotype and aggregated genotype. One limitation of existing methods is that they only look into the marginal effects of rare variants but do not systematically take into account effects due to interactions among rare variants and between rare variants and environmental factors. In this article, we propose the summation of partition approach (SPA), a robust model-free method that is designed specifically for detecting both marginal effects and effects due to gene-gene (G×G) and gene-environmental (G×E) interactions for rare variants association studies. SPA has three advantages. First, it accounts for the interaction information and gains considerable power in the presence of unknown and complicated G×G or G×E interactions. Secondly, it does not sacrifice the marginal detection power; in the situation when rare variants only have marginal effects it is comparable with the most competitive method in current literature. Thirdly, it is easy to extend and can incorporate more complex interactions; other practitioners and scientists can tailor the procedure to fit their own study friendly. Our simulation studies show that SPA is considerably more powerful than many existing methods in the presence of G×G and G×E interactions. This package is also maintained on the Comprehensive R Archive Network (http://cran.r-project.org). It contains the R programs, user's manual and example codes.Genetics, Statisticsrf2283, shl5StatisticsComputer softwareSource codes for GLMLE algorithm
http://academiccommons.columbia.edu/catalog/ac:178966
Zheng, Tian; He, Ranhttp://dx.doi.org/10.7916/D8HH6HQRFri, 24 Oct 2014 00:00:00 +0000These are the R source codes for the algorithm proposed for fitting exponential random graph models (ERGMs) on large social networks in our paper "Estimation of exponential random graph models for large social networks via graph limits". Specifically, the ERGM model we implement is the one that consider homomorphism densities of edges, two-stars and triangles, the one we examine in the above paper.Statistics, Computer sciencetz33, rh2528StatisticsComputer softwareNew insights into old methods for identifying causal rare variants
http://academiccommons.columbia.edu/catalog/ac:195277
Hu, Inchi; Zheng, Tian; Huang, Chien-Hsun; Lo, Shaw-Hwa; Wang, Haitianhttp://dx.doi.org/10.7916/D8J38R1MTue, 09 Sep 2014 16:21:21 +0000The advance of high-throughput next-generation sequencing technology makes possible the analysis of rare variants. However, the investigation of rare variants in unrelated-individuals data sets faces the challenge of low power, and most methods circumvent the difficulty by using various collapsing procedures based on genes, pathways, or gene clusters. We suggest a new way to identify causal rare variants using the F-statistic and sliced inverse regression. The procedure is tested on the data set provided by the Genetic Analysis Workshop 17 (GAW17). After preliminary data reduction, we ranked markers according to their F-statistic values. Top-ranked markers were then subjected to sliced inverse regression, and those with higher absolute coefficients in the most significant sliced inverse regression direction were selected. The procedure yields good false discovery rates for the GAW17 data and thus is a promising method for future study on rare variants.Biostatistics, Statisticstz33, shl5StatisticsArticlesBAMarray™: Java software for Bayesian analysis of variance for microarray data
http://academiccommons.columbia.edu/catalog/ac:192099
Ishwaran, Hemant; Rao, J. Sunil; Kogalur, Udaya B.http://dx.doi.org/10.7916/D8BR8QNZTue, 09 Sep 2014 00:33:07 +0000Background: DNA microarrays open up a new horizon for studying the genetic determinants of disease. The high throughput nature of these arrays creates an enormous wealth of information, but also poses a challenge to data analysis. Inferential problems become even more pronounced as experimental designs used to collect data become more complex. An important example is multigroup data collected over different experimental groups, such as data collected from distinct stages of a disease process. We have developed a method specifically addressing these issues termed Bayesian ANOVA for microarrays (BAM). The BAM approach uses a special inferential regularization known as spike-and-slab shrinkage that provides an optimal balance between total false detections and total false non-detections. This translates into more reproducible differential calls. Spike and slab shrinkage is a form of regularization achieved by using information across all genes and groups simultaneously.
Results: BAMarray™ is a graphically oriented Java-based software package that implements the BAM method for detecting differentially expressing genes in multigroup microarray experiments (up to 256 experimental groups can be analyzed). Drop-down menus allow the user to easily select between different models and to choose various run options. BAMarray™ can also be operated in a fully automated mode with preselected run options. Tuning parameters have been preset at theoretically optimal values freeing the user from such specifications. BAMarray™ provides estimates for gene differential effects and automatically estimates data adaptive, optimal cutoff values for classifying genes into biological patterns of differential activity across experimental groups. A graphical suite is a core feature of the product and includes diagnostic plots for assessing model assumptions and interactive plots that enable tracking of prespecified gene lists to study such things as biological pathway perturbations. The user can zoom in and lasso genes of interest that can then be saved for downstream analyses.
Conclusion: BAMarray™ is user friendly platform independent software that effectively and efficiently implements the BAM methodology. Classifying patterns of differential activity is greatly facilitated by a data adaptive cutoff rule and a graphical suite. BAMarray™ is licensed software freely available to academic institutions. More information can be found at http://www.bamarray.com.Statistics, Information technology, Bioinformaticsubk2101StatisticsArticlesLimit Theory for Spatial Processes, Bootstrap Quantile Variance Estimators, and Efficiency Measures for Markov Chain Monte Carlo
http://academiccommons.columbia.edu/catalog/ac:188852
Yang, Xuanhttp://dx.doi.org/10.7916/D84X560ZThu, 07 Aug 2014 12:12:31 +0000This thesis contains three topics: (I) limit theory for spatial processes, (II) asymptotic results on the bootstrap quantile variance estimator for importance sampling, and (III) an efficiency measure of MCMC.
(I) First, central limit theorems are obtained for sums of observations from a $\kappa$-weakly dependent random field. In particular, it is considered that the observations are made from a random field at irregularly spaced and possibly random locations. The sums of these samples as well as sums of functions of pairs of the observations are objects of interest; the latter has applications in covariance estimation, composite likelihood estimation, etc. Moreover, examples of $\kappa$-weakly dependent random fields are explored and a method for the evaluation of $\kappa$-coefficients is presented.
Next, statistical inference is considered for the stochastic heteroscedastic processes (SHP) which generalize the stochastic volatility time series model to space. A composite likelihood approach is adopted for parameter estimation, where the composite likelihood function is formed by a weighted sum of pairwise log-likelihood functions. In addition, the observations sites are assumed to distributed according to a spatial point process. Sufficient conditions are provided for the maximum composite likelihood estimator to be consistent and asymptotically normal.
(II) It is often difficult to provide an accurate estimation for the variance of the weighted sample quantile. Its asymptotic approximation requires the value of the density function which may be hard to evaluate in complex systems. To circumvent this problem, the bootstrap estimator is considered. Theoretical results are established for the exact convergence rate and asymptotic distributions of the bootstrap variance estimators for quantiles of weighted empirical distributions. Under regularity conditions, it is shown that the bootstrap variance estimator is asymptotically normal and has relative standard deviation of order O(n^-1/4)
(III) A new performance measure is proposed to evaluate the efficiency of Markov chain Monte Carlo (MCMC) algorithms. More precisely, the large deviations rate of the probability that the Monte Carlo estimator deviates from the true by a certain distance is used as a measure of efficiency of a particular MCMC algorithm. Numerical methods are proposed for the computation of the rate function based on samples of the renewal cycles of the Markov chain. Furthermore the efficiency measure is applied to an array of MCMC schemes to determine their optimal tuning parameters.Statisticsxy2139StatisticsDissertationsUnbiased Penetrance Estimates with Unknown Ascertainment Strategies
http://academiccommons.columbia.edu/catalog/ac:175879
Gore, Kristenhttp://dx.doi.org/10.7916/D8KP8098Mon, 07 Jul 2014 00:00:00 +0000Allelic variation in the genome leads to variation in individuals' production of proteins. This, in turn, leads to variation in traits and development, and, in some cases, to diseases. Understanding the genetic basis for disease can aid in the search for therapies and in guiding genetic counseling. Thus, it is of interest to discover the genes with mutations responsible for diseases and to understand the impact of allelic variation at those genes. A subject's genetic composition is commonly referred to as the subject's genotype. Subjects who carry the gene mutation of interests are referred to as carriers. Subjects who are afflicted with a disease under study (that is, subjects who exhibit the phenotype) are termed affected carriers. The age-specific probability that a given subject will exhibit a phenotype of interest, given mutation status at a gene is known as penetrance. Understanding penetrance is an important facet of genetic epidemiology. Penetrance estimates are typically calculated via maximum likelihood from family data. However, penetrance estimates can be biased if the nature of the sampling strategy is not correctly reflected in the likelihood. Unfortunately, sampling of family data may be conducted in a haphazard fashion or, even if conducted systematically, might be reported in an incomplete fashion. Bias is possible in applying likelihood methods to reported data if (as is commonly the case) some unaffected family members are not represented in the reports. The purpose here is to present an approach to find efficient and unbiased penetrance estimates in cases where there is incomplete knowledge of the sampling strategy and incomplete information on the full pedigree structure of families included in the data. The method may be applied with different conjectural assumptions about the ascertainment strategy to balance the possibly biasing effects of wishful assumptions about the sampling strategy with the efficiency gains that could be obtained through valid assumptions.StatisticsStatisticsDissertationsStatistical Inference and Experimental Design for Q-matrix Based Cognitive Diagnosis Models
http://academiccommons.columbia.edu/catalog/ac:176169
Zhang, Stephaniehttp://dx.doi.org/10.7916/D8TQ5ZP5Mon, 07 Jul 2014 00:00:00 +0000There has been growing interest in recent years in using cognitive diagnosis models for diagnostic measurement, i.e., classification according to multiple discrete latent traits. The Q-matrix, an incidence matrix specifying the presence or absence of a relationship between each item in the assessment and each latent attribute, is central to many of these models. Important applications include educational and psychological testing; demand in education, for example, has been driven by recent focus on skills-based evaluation. However, compared to more traditional models coming from classical test theory and item response theory, cognitive diagnosis models are relatively undeveloped and suffer from several issues limiting their applicability. This thesis exams several issues related to statistical inference and experimental design for Q-matrix based cognitive diagnosis models. We begin by considering one of the main statistical issues affecting the practical use of Q-matrix based cognitive diagnosis models, the identifiability issue. In statistical models, identifiability is prerequisite for most common statistical inferences, including parameter estimation and hypothesis testing. With Q-matrix based cognitive diagnosis models, identifiability also affects the classification of respondents according to their latent traits. We begin by examining the identifiability of model parameters, presenting necessary and sufficient conditions for identifiability in several settings. Depending on the area of application and the researcher's degree of control over the experiment design, fulfilling these identifiability conditions may be difficult. The second part of this thesis proposes new methods for parameter estimation and respondent classification for use with non-identifiable models. In addition, our framework allows consistent estimation of the severity of the non-identifiability problem, in terms of the proportion of the population affected by it. The implications of this measure for the design of diagnostic assessments are also discussed.Statistics, Educational tests and measurements, Quantitative psychology and psychometricsStatisticsDissertationsLearning Theory Analysis for Association Rules and Sequential Event Prediction
http://academiccommons.columbia.edu/catalog/ac:173905
Rudin, Cynthia; Letham, Benjamin; Madigan, David B.http://dx.doi.org/10.7916/D82N50C1Thu, 15 May 2014 00:00:00 +0000We present a theoretical analysis for prediction algorithms based on association rules. As part of this analysis, we introduce a problem for which rules are particularly natural, called “sequential event prediction." In sequential event prediction, events in a sequence are revealed one by one, and the goal is to determine which event will next be revealed. The training set is a collection of past sequences of events. An example application is to predict which item will next be placed into a customer's online shopping cart, given his/her past purchases. In the context of this problem, algorithms based on association rules have distinct advantages over classical statistical and machine learning methods: they look at correlations based on subsets of co-occurring past events (items a and b imply item c), they can be applied to the sequential event prediction problem in a natural way, they can potentially handle the “cold start" problem where the training set is small, and they yield interpretable predictions. In this work, we present two algorithms that incorporate association rules. These algorithms can be used both for sequential event prediction and for supervised classification, and they are simple enough that they can possibly be understood by users, customers, patients, managers, etc. We provide generalization guarantees on these algorithms based on algorithmic stability analysis from statistical learning theory. We include a discussion of the strict minimum support threshold often used in association rule mining, and introduce an “adjusted confidence" measure that provides a weaker minimum support condition that has advantages over the strict minimum support. The paper brings together ideas from statistical learning theory, association rule mining and Bayesian analysis.Statistics, Artificial intelligencedm2418StatisticsArticlesAnalysis of Variance of Cross-Validation Estimators of the Generalization Error
http://academiccommons.columbia.edu/catalog/ac:173902
Markatou, Marianthi; Tian, Hong; Biswas, Shameek; Hripcsak, George M.http://dx.doi.org/10.7916/D86D5R2XThu, 15 May 2014 00:00:00 +0000This paper brings together methods from two different disciplines: statistics and machine learning. We address the problem of estimating the variance of cross-validation (CV) estimators of the generalization error. In particular, we approach the problem of variance estimation of the CV estimators of generalization error as a problem in approximating the moments of a statistic. The approximation illustrates the role of training and test sets in the performance of the algorithm. It provides a unifying approach to evaluation of various methods used in obtaining training and test sets and it takes into account the variability due to different training and test sets. For the simple problem of predicting the sample mean and in the case of smooth loss functions, we show that the variance of the CV estimator of the generalization error is a function of the moments of the random variables Y=Card(Sj ∩ Sj') and Y*=Card(Sjc ∩ Sj'c), where Sj, Sj' are two training sets, and Sjc, Sj'c are the corresponding test sets. We prove that the distribution of Y and Y* is hypergeometric and we compare our estimator with the one proposed by Nadeau and Bengio (2003). We extend these results in the regression case and the case of absolute error loss, and indicate how the methods can be extended to the classification case. We illustrate the results through simulation.Statistics, Artificial intelligencemm168, ht2031, spb2003, gh13Statistics, Biomedical Informatics, BiostatisticsArticlesAlgorithms for Sparse Linear Classifiers in the Massive Data Setting
http://academiccommons.columbia.edu/catalog/ac:173908
Balakrishnan, Suhrid; Bartlett, Peter; Madigan, David B.http://dx.doi.org/10.7916/D8Z0368XThu, 15 May 2014 00:00:00 +0000Classifiers favoring sparse solutions, such as support vector machines, relevance vector machines, LASSO-regression based classifiers, etc., provide competitive methods for classification problems in high dimensions. However, current algorithms for training sparse classifiers typically scale quite unfavorably with respect to the number of training examples. This paper proposes online and multi-pass algorithms for training sparse linear classifiers for high dimensional data. These algorithms have computational complexity and memory requirements that make learning on massive data sets feasible. The central idea that makes this possible is a straightforward quadratic approximation to the likelihood function.Statistics, Artificial intelligencedm2418StatisticsArticlesBook Reviews: Principles of Data Mining. By David Hand, Heikki Mannila, and Padhraic Smyth.
http://academiccommons.columbia.edu/catalog/ac:173915
Madigan, David B.http://dx.doi.org/10.7916/D8DZ06D8Thu, 15 May 2014 00:00:00 +0000"Principles of Data Mining. By David Hand, Heikki Mannila, and Padhraic Smyth. MIT Press, Cambridge, MA, 2001. $50.00. xxxii+546 pp., hardcover. ISBN 0-262-08290-X. Is data mining the same as statistics? The distinguished authors of Principles of Data Mining struggle to make a distinction between the two subjects. In the end, what they have written is a fine applied statistics text." -- page 501Statisticsdm2418StatisticsReviewsMedication-Wide Association Studies
http://academiccommons.columbia.edu/catalog/ac:173912
Ryan, P. B.; Stang, P. E.; Madigan, David B.; Schuemie, M. J.; Hripcsak, George M.http://dx.doi.org/10.7916/D8PG1PVXThu, 15 May 2014 00:00:00 +0000Undiscovered side effects of drugs can have a profound effect on the health of the nation, and electronic health-care databases offer opportunities to speed up the discovery of these side effects. We applied a “medication-wide association study” approach that combined multivariate analysis with exploratory visualization to study four health outcomes of interest in an administrative claims database of 46 million patients and a clinical database of 11 million patients. The technique had good predictive value, but there was no threshold high enough to eliminate false-positive findings. The visualization not only highlighted the class effects that strengthened the review of specific products but also underscored the challenges in confounding. These findings suggest that observational databases are useful for identifying potential associations that warrant further consideration but are unlikely to provide definitive evidence of causal effects.Pharmacology, Statistics, Bioinformaticsdm2418, gh13Statistics, Biomedical InformaticsArticlesA One-Pass Sequential Monte Carlo Method for Bayesian Analysis of Massive Datasets
http://academiccommons.columbia.edu/catalog/ac:173899
Balakrishnan, Suhrid; Madigan, David B.http://dx.doi.org/10.7916/D8B56GTPThu, 15 May 2014 00:00:00 +0000For Bayesian analysis of massive data, Markov chain Monte Carlo (MCMC) techniques often prove infeasible due to computational resource constraints. Standard MCMC methods generally require a complete scan of the dataset for each iteration. Ridgeway and Madigan (2002) and Chopin (2002b) recently presented importance sampling algorithms that combined simulations from a posterior distribution conditioned on a small portion of the dataset with a reweighting of those simulations to condition on the remainder of the dataset. While these algorithms drastically reduce the number of data accesses as compared to traditional MCMC, they still require substantially more than a single pass over the dataset. In this paper, we present "1PFS," an efficient, one-pass algorithm. The algorithm employs a simple modification of the Ridgeway and Madigan (2002) particle filtering algorithm that replaces the MCMC based "rejuvenation" step with a more efficient "shrinkage" kernel smoothing based step. To show proof-of-concept and to enable a direct comparison, we demonstrate 1PFS on the same examples presented in Ridgeway and Madigan (2002), namely a mixture model for Markov chains and Bayesian logistic regression. Our results indicate the proposed scheme delivers accurate parameter estimates while employing only a single pass through the data.Mathematics, Statisticsdm2418StatisticsArticlesA Characterization of Markov Equivalence Classes for Acyclic Digraphs
http://academiccommons.columbia.edu/catalog/ac:173896
Andersson, Steen A.; Madigan, David B.; Perlman, Michael D.http://dx.doi.org/10.7916/D8FX77J3Thu, 15 May 2014 00:00:00 +0000Undirected graphs and acyclic digraphs (ADG's), as well as their mutual extension to chain graphs, are widely used to describe dependencies among variables in multiviarate distributions. In particular, the likelihood functions of ADG models admit convenient recursive factorizations that often allow explicit maximum likelihood estimates and that are well suited to building Bayesian networks for expert systems. Whereas the undirected graph associated with a dependence model is uniquely determined, there may be many ADG's that determine the same dependence (i.e., Markov) model. Thus, the family of all ADG's with a given set of vertices is naturally partitioned into Markov-equivalence classes, each class being associated with a unique statistical model. Statistical procedures, such as model selection of model averaging, that fail to take into account these equivalence classes may incur substantial computational or other inefficiences. Here it is show that each Markov-equivalence class is uniquely determined by a single chain graph, the essential graph, that is itself simultaneously Markov equivalent to all ADG's in the equivalence class. Essential graphs are characterized, a polynomial-time algorithm for their construction is given, and their applications to model selection and other statistical questions are described.Mathematics, Statistics, Theoretical mathematicsdm2418StatisticsArticlesBayesian Hierarchical Rule Modeling for Predicting Medical Conditions
http://academiccommons.columbia.edu/catalog/ac:173882
McCormick, Tyler H.; Rudin, Cynthia; Madigan, David B.http://dx.doi.org/10.7916/D8V69GP1Wed, 14 May 2014 00:00:00 +0000We propose a statistical modeling technique, called the Hierarchical Association Rule Model (HARM), that predicts a patient’s possible future medical conditions given the patient’s current and past history of reported conditions. The core of our technique is a Bayesian hierarchical model for selecting predictive association rules (such as “condition 1 and condition 2 → condition 3”) from a large set of candidate rules. Because this method “borrows strength” using the conditions of many similar patients, it is able to provide predictions specialized to any given patient, even when little information about the patient’s history of conditions is available.Applied mathematics, Statistics, Medicinedm2418StatisticsArticlesCorrection: Separation and completeness properties for AMP chain graph Markov models
http://academiccommons.columbia.edu/catalog/ac:173887
Madigan, David B.; Levitz, Michael; Perlman, Michael D.http://dx.doi.org/10.7916/D8QF8R05Wed, 14 May 2014 00:00:00 +0000Correction of table 2 on page 1757 of 'Separation and completeness properties for AMP chain graph Markov models', Annals of Statistics, volume 29 (2001).Mathematics, Statisticsdm2418StatisticsArticlesGenerating Productive Dialogue between Consulting Statisticians and their Clients in the Pharmaceutical and Medical Research Settings
http://academiccommons.columbia.edu/catalog/ac:173832
Emir, Birol; Amaratunga, Dhammika; Beltangady, Mohan; Cabrera, Javier; Freeman, Roy; Madigan, David B.; Nguyen, Ha H.; Whalen, Edward Patrickhttp://dx.doi.org/10.7916/D8PK0D8NTue, 13 May 2014 00:00:00 +0000Due to the ever-increasing complexity of scientific technologies and resulting data, consulting statisticians are becoming more involved in the design, conduct, and analysis of biomedical research. This requires extensive collaboration between the consulting statistician and nonstatisticians, such as researchers, clinicians, and corporate executives. Consequently, a successful consulting career is becoming ever more dependent on the statistician's ability to effectively communicate with nonstatisticians. This is especially true when more complex, nontraditional analytical methods are required. In this paper, we examine the collaboration between statisticians and nonstatisticians from three different professional perspectives. Integrating these perspectives, we discuss ways to help the consulting statistician generate productive dialogue with clients. Finally, we examine how universities can better prepare students for careers in statistical consulting by incorporating more communication-based elements into their curriculum and by offering students ample opportunities to collaborate with nonstatisticians. Overall, we designed this exercise to help the consulting statistician generate dialogue with clients that results in more productive collaborations and a more satisfying work experience.Statistics, Bioinformatics, Medicinebe2166, dm2418, hhn2108, ew2320StatisticsArticlesA Note on Equivalence Classes of Directed Acyclic Independence Graphs
http://academiccommons.columbia.edu/catalog/ac:173826
Madigan, David B.http://dx.doi.org/10.7916/D8TB150CTue, 13 May 2014 00:00:00 +0000Directed acyclic independence graphs (DAIGs) play an important role in recent developments in probabilistic expert systems and influence diagrams (Chyu [1]). The purpose of this note is to show that DAIGs can usefully be grouped into equivalence classes where the members of a single class share identical Markov properties. These equivalence classes can be identified via a simple graphical criterion. This result is particularly relevant to model selection procedures for DAIGs (see, e.g., Cooper and Herskovits [2] and Madigan and Raftery [4]) because it reduces the problem of searching among possible orientations of a given graph to that of searching among the equivalence classes.Mathematics, Statisticsdm2418StatisticsArticlesA Flexible Bayesian Generalized Linear Model for Dichotomous Response Data with an Application to Text Categorization
http://academiccommons.columbia.edu/catalog/ac:173817
Eyheramendy, Susana; Madigan, David B.http://dx.doi.org/10.7916/D86M34ZFTue, 13 May 2014 00:00:00 +0000We present a class of sparse generalized linear models that include probit and logistic regression as special cases and offer some extra flexibility. We provide an EM algorithm for learning the parameters of these models from data. We apply our method in text classification and in simulated data and show that our method outperforms the logistic and probit models and also the elastic net, in general by a substantial margin.Mathematics, Statistics, Theoretical mathematicsdm2418StatisticsBook chaptersLocation Estimation in Wireless Networks: A Bayesian Approach
http://academiccommons.columbia.edu/catalog/ac:173820
Madigan, David B.; Ju, Wen-Hua; Krishnan, P.; Krishnakumar, A. S. ; Zorych, Ivanhttp://dx.doi.org/10.7916/D82V2D74Tue, 13 May 2014 00:00:00 +0000We present a Bayesian hierarchical model for indoor location estimation in wireless networks. We demonstrate that out model achieves accuracy that is similar to other published models and algorithms. By harnessing prior knowledge, our model drastically reduces the requirement for training data as compared with existing approaches.Mathematics, Statistics, Applied mathematicsdm2418StatisticsArticles[Least Angle Regression]: Discussion
http://academiccommons.columbia.edu/catalog/ac:173841
Madigan, David B.; Ridgeway, Greghttp://dx.doi.org/10.7916/D81V5C29Tue, 13 May 2014 00:00:00 +0000Algorithms for simultaneous shrinkage and selection in regression and classification provide attractive solutions to knotty old statistical challenges. Nevertheless, as far as we can tell, Tibshirani's Lasso algorithm has had little impact on statistical practice. Two particular reasons for this may be the relative inefficiency of the original Lasso algorithm and the relative complexity of more recent Lasso algorithms [e.g., Osborne, Presnell and Turlach (2000)]. Efron, Hastie, Johnstone and Tibshirani have provided an efficient, simple algorithm for the Lasso as well as algorithms for stagewise regression and the new least angle regression. As such this paper is an important contribution to statistical computing.Mathematics, Statisticsdm2418StatisticsArticlesA Hierarchical Model for Association Rule Mining of Sequential Events: An Approach to Automated Medical Symptom Prediction
http://academiccommons.columbia.edu/catalog/ac:173838
McCormick, Tyler H.; Rudin, Cynthia; Madigan, David B.http://dx.doi.org/10.7916/D89C6VJDTue, 13 May 2014 00:00:00 +0000In many healthcare settings, patients visit healthcare professionals periodically and report multiple medical conditions, or symptoms, at each encounter. We propose a statistical modeling technique, called the Hierarchical Association Rule Model (HARM), that predicts a patient’s possible future symptoms given the patient’s current and past history of reported symptoms. The core of our technique is a Bayesian hierarchical model for selecting predictive association rules (such as “symptom 1 and symptom 2 → symptom 3 ”) from a large set of candidate rules. Because this method “borrows strength” using the symptoms of many similar patients, it is able to provide predictions specialized to any given patient, even when little information about the patient’s history of symptoms is available.Mathematics, Statistics, Medicinedm2418StatisticsArticles[Bayesian Analysis in Expert Systems]: Comment: What's Next?
http://academiccommons.columbia.edu/catalog/ac:173856
Madigan, David B.http://dx.doi.org/10.7916/D8W37TFJTue, 13 May 2014 00:00:00 +0000"These papers represent two of the many different graphical modeling camps that have emerged from a flurry of activity in the past decade. The paper by Cox and Wermuth falls within the statistical graphical modeling camp and provides a useful generalization of that body of work. There is, of course, a price to be paid for this generality, namely that the interpretation of the graphs is more complex...The paper by Spiegelhalter, Dawid, Lauritzen and Cowell falls within the probabilistic expert system camp. This is a tour de force by researchers responsible for much of the astonishing progress in this area. Ten years ago, probabilistic models were shunned by the artificial intelligence community. That they are now widely accepted and used is due in large measure to the insights and efforts of these authors, along with other pioneers such as Judea Pearl and Peter Cheeseman..." -- page 261Mathematics, Statisticsdm2418StatisticsArticlesSeparation and Completeness Properties for Amp Chain Graph Markov Models
http://academiccommons.columbia.edu/catalog/ac:173847
Levitz, Michael; Perlman, Michael D.; Madigan, David B.http://dx.doi.org/10.7916/D8X34VJGTue, 13 May 2014 00:00:00 +0000Pearl’s well-known d-separation criterion for an acyclic directed graph (ADG) is a pathwise separation criterion that can be used to efficiently identify all valid conditional independence relations in the Markov model determined by the graph. This paper introduces p-separation, a pathwise separation criterion that efficiently identifies all valid conditional independences under the Andersson–Madigan–Perlman (AMP) alternative Markov property for chain graphs (= adicyclic graphs), which include both ADGs and undirected graphs as special cases. The equivalence of p-separation to the augmentation criterion occurring in the AMP global Markov property is established, and p-separation is applied to prove completeness of the global Markov property for AMP chain graph models. Strong completeness of the AMP Markov property is established, that is, the existence of Markov perfect distributions that satisfy those and only those conditional independences implied by the AMP property(equivalently, by p-separation). A linear-time algorithm for determining p-separation is presented.Mathematics, Statistics, Theoretical mathematicsdm2418StatisticsArticlesBayesian Model Averaging: a Tutorial (with Comments by M. Clyde, David Draper and E. I. George, and a Rejoinder by the Authors)
http://academiccommons.columbia.edu/catalog/ac:173853
Hoeting, Jennifer A.; Madigan, David B.; Raftery, Adrian E.; Volinsky, Chris T.; Clyde, M.; Draper, David; George, E. I.http://dx.doi.org/10.7916/D84M92N7Tue, 13 May 2014 00:00:00 +0000Standard statistical practice ignores model uncertainty. Data analysts typically select a model from some class of models and then proceed as if the selected model had generated the data. This approach ignores the uncertainty in model selection, leading to over-confident inferences and decisions that are more risky than one thinks they are. Bayesian model averaging (BMA)provides a coherent mechanism for accounting for this model uncertainty. Several methods for implementing BMA have recently emerged. We discuss these methods and present a number of examples.In these examples, BMA provides improved out-of-sample predictive performance. We also provide a catalogue of currently available BMA software.Statisticsdm2418StatisticsArticlesFit GFuseTLP penalized conditional logistic regression model for high-dimensional one-to- one matched case-control data
http://academiccommons.columbia.edu/catalog/ac:174087
Zhou, Hui; Wang, Shuang; Zheng, Tianhttp://dx.doi.org/10.7916/D8028PNJMon, 12 May 2014 00:00:00 +0000Fit GFuseTLP penalized conditional logistic regression model for high-dimensional one-to- one matched case-control dataStatisticshz2240, sw2206, tz33Statistics, BiostatisticsComputer softwareInteraction-Based Learning for High-Dimensional Data with Continuous Predictors
http://academiccommons.columbia.edu/catalog/ac:196647
Huang, Chien-Hsunhttp://dx.doi.org/10.7916/D8X928CHMon, 07 Apr 2014 14:48:11 +0000High-dimensional data, such as that relating to gene expression in microarray experiments, may contain substantial amount of useful information to be explored. However, the information, relevant variables and their joint interactions are usually diluted by noise due to a large number of non-informative variables. Consequently, variable selection plays a pivotal role for learning in high dimensional problems. Most of the traditional feature selection methods, such as Pearson's correlation between response and predictors, stepwise linear regressions and LASSO are among the popular linear methods. These methods are effective in identifying linear marginal effect but are limited in detecting non-linear or higher order interaction effects. It is well known that epistasis (gene - gene interactions) may play an important role in gene expression where unknown functional forms are difficult to identify. In this thesis, we propose a novel nonparametric measure to first screen and do feature selection based on information from nearest neighborhoods. The method is inspired by Lo and Zheng's earlier work (2002) on detecting interactions for discrete predictors. We apply a backward elimination algorithm based on this measure which leads to the identification of many in influential clusters of variables. Those identified groups of variables can capture both marginal and interactive effects. Second, each identified cluster has the potential to perform predictions and classifications more accurately. We also study procedures how to combine these groups of individual classifiers to form a final predictor. Through simulation and real data analysis, the proposed measure is capable of identifying important variable sets and patterns including higher-order interaction sets. The proposed procedure outperforms existing methods in three different microarray datasets. Moreover, the nonparametric measure is quite flexible and can be easily extended and applied to other areas of high-dimensional data and studies.Statisticsch2526StatisticsDissertationsA Point Process Model for the Dynamics of Limit Order Books
http://academiccommons.columbia.edu/catalog/ac:171221
Vinkovskaya, Ekaterinahttp://dx.doi.org/10.7916/D88913WWFri, 28 Feb 2014 00:00:00 +0000This thesis focuses on the statistical modeling of the dynamics of limit order books in electronic equity markets. The statistical properties of events affecting a limit order book -market orders, limit orders and cancellations- reveal strong evidence of clustering in time, cross-correlation across event types and dependence of the order flow on the bid-ask spread. Further investigation reveals the presence of a self-exciting property - that a large number of events in a given time period tends to imply a higher probability of observing a large number of events in the following time period. We show that these properties may be adequately represented by a multivariate self-exciting point process with multiple regimes that reflect changes in the bid-ask spread. We propose a tractable parametrization of the model and perform a Maximum Likelihood Estimation of the model using high-frequency data from the Trades and Quotes database for US stocks. We show that the model may be used to obtain predictions of order flow and that its predictive performance beats the Poisson model as well as Moving Average and Auto Regressive time series models.StatisticsStatisticsDissertationsMixed Methods for Mixed Models
http://academiccommons.columbia.edu/catalog/ac:169644
Dorie, Vincent J.http://dx.doi.org/10.7916/D8V40S5XWed, 22 Jan 2014 00:00:00 +0000This work bridges the frequentist and Bayesian approaches to mixed models by borrowing the best features from both camps: point estimation procedures are combined with priors to obtain accurate, fast inference while posterior simulation techniques are developed that approximate the likelihood with great precision for the purposes of assessing uncertainty. These allow flexible inferences without the need to rely on expensive Markov chain Monte Carlo simulation techniques. Default priors are developed and evaluated in a variety of simulation and real-world settings with the end result that we propose a new set of standard approaches that yield superior performance at little computational cost.StatisticsStatisticsDissertationsKernel-based association measures
http://academiccommons.columbia.edu/catalog/ac:167034
Liu, Yinghttp://hdl.handle.net/10022/AC:P:22154Thu, 07 Nov 2013 00:00:00 +0000Measures of associations have been widely used for describing the statistical relationships between two sets of variables. Traditional association measures tend to focus on specialized settings (specific types of variables or association patterns). Based on an in-depth summary of existing measures, we propose a general framework for association measures unifying existing methods and novel extensions based on kernels, including practical solutions to computational challenges. The proposed framework provides improved feature selection and extensions to a variety of current classifiers. Specifically, we introduce association screening and variable selection via maximizing kernel-based association measures. We also develop a backward dropping procedure for feature selection when there are a large number of candidate variables. We evaluate our framework using a wide variety of both simulated and real data. In particular, we conduct independence tests and feature selection using kernel association measures on diversified association patterns of different dimensions and variable types. The results show the superiority of our methods to existing ones. We also apply our framework to four real-word problems, three from statistical genetics and one of gender prediction from handwriting. We demonstrate through these applications both the de novo construction of new kernels and the adaptation of existing kernels tailored to the data at hand, and how kernel-based measures of associations can be naturally applied to different data structures including functional input and output spaces. This shows that our framework can be applied to a wide range of real world problems and work well in practice.Statistics, Computer scienceyl2802StatisticsDissertationsInference of functional neural connectivity and convergence acceleration methods
http://academiccommons.columbia.edu/catalog/ac:179409
Nikitchenko, Maxim V.http://hdl.handle.net/10022/AC:P:22052Thu, 31 Oct 2013 00:00:00 +0000The knowledge of the maps of neuronal interactions is key for system neuroscience, but at the moment we possess relatively little of it . The recent development of experimental methods which allow a simultaneous recording of the spiking activity, but not the intracellular voltage, of thousands of neurons gives us an opportunity to start filling that gap. In Chapter 2, I present a method for the inference of the parameters of the leaky integrate-and-fire (LIF) model featuring time-dependent currents and conductances based only on the extracellular recording of spiking in the network. The fitted parameters can describe the functional connections in the network, as well as the internal properties of the cells. The method can also be used to determine whether a single-compartment model of a neuron should include conductance- or current-based synapses, or their mixture. In addition, because the same mathematical model describes some of the flavors of the Drift Diffusion Model (DDM), popular in the studies of decision making process, the presented method can be readily used to fit their parameters. Making the proposed inference procedure -- based on the expectation-maximization (EM) algorithm -- accurate and robust, necessitated a development of a new numerical adaptive-grid (AG) method for the forward-backward (FB) propagation of the probability density, which is required in the computation of the sufficient statistic in the EM algorithm. These topics are covered in Chapter 3. Another issue which had to be addressed in order to obtain a usable inference algorithm is the well known slow convergence of the EM algorithm in the flat regions of the loglikelihood. Two complementary approaches to this issue are presented in this dissertation. In Chapter 4, I present a new framework for the acceleration of convergence of iterative algorithms (not limited to the EM) which unifies all previously known methods and allows us to construct a new method demonstrating the best performance of them all. To make the computations even faster, I wrote a Matlab package which allows them to be done in parallel on several machines and clusters. As one can see, all the aforementioned projects were sprouted up from one "head" project on the inference of the LIF model parameters. At the end of the dissertation, I briefly describe a disconnected project which is devoted to the development of a flexible experimental setup (software and hardware) for behavioral experiments, with a specific application to a particular type of the virtual Morris water maze experiment (VMWM).Neurosciences, Statisticsmvn2104Statistics, Neurobiology and BehaviorDissertationsLow-rank graphical models and Bayesian inference in the statistical analysis of noisy neural data
http://academiccommons.columbia.edu/catalog/ac:166472
Smith, Carl Alexanderhttp://hdl.handle.net/10022/AC:P:21991Fri, 11 Oct 2013 00:00:00 +0000We develop new methods of Bayesian inference, largely in the context of analysis of neuroscience data. The work is broken into several parts. In the first part, we introduce a novel class of joint probability distributions in which exact inference is tractable. Previously it has been difficult to find general constructions for models in which efficient exact inference is possible, outside of certain classical cases. We identify a class of such models that are tractable owing to a certain "low-rank" structure in the potentials that couple neighboring variables. In the second part we develop methods to quantify and measure information loss in analysis of neuronal spike train data due to two types of noise, making use of the ideas developed in the first part. Information about neuronal identity or temporal resolution may be lost during spike detection and sorting, or precision of spike times may be corrupted by various effects. We quantify the information lost due to these effects for the relatively simple but sufficiently broad class of Markovian model neurons. We find that decoders that model the probability distribution of spike-neuron assignments significantly outperform decoders that use only the most likely spike assignments. We also apply the ideas of the low-rank models from the first section to defining a class of prior distributions over the space of stimuli (or other covariate) which, by conjugacy, preserve the tractability of inference. In the third part, we treat Bayesian methods for the estimation of sparse signals, with application to the locating of synapses in a dendritic tree. We develop a compartmentalized model of the dendritic tree. Building on previous work that applied and generalized ideas of least angle regression to obtain a fast Bayesian solution to the resulting estimation problem, we describe two other approaches to the same problem, one employing a horseshoe prior and the other using various spike-and-slab priors. In the last part, we revisit the low-rank models of the first section and apply them to the problem of inferring orientation selectivity maps from noisy observations of orientation preference. The relevant low-rank model exploits the self-conjugacy of the von Mises distribution on the circle. Because the orientation map model is loopy, we cannot do exact inference on the low-rank model by the forward backward algorithm, but block-wise Gibbs sampling by the forward backward algorithm speeds mixing. We explore another von Mises coupling potential Gibbs sampler that proves to effectively smooth noisily observed orientation maps.Statistics, Neurosciencescas2207Statistics, ChemistryDissertationsGeneralized Volatility-Stabilized Processes
http://academiccommons.columbia.edu/catalog/ac:165162
Pickova, Radkahttp://hdl.handle.net/10022/AC:P:21616Fri, 13 Sep 2013 00:00:00 +0000In this thesis, we consider systems of interacting diffusion processes which we call Generalized Volatility-Stabilized processes, as they extend the Volatility-Stabilized Market models introduced in Fernholz and Karatzas (2005). First, we show how to construct a weak solution of the underlying system of stochastic differential equations. In particular, we express the solution in terms of time-changed squared-Bessel processes and argue that this solution is unique in distribution. In addition, we also discuss sufficient conditions under which this solution does not explode in finite time, and provide sufficient conditions for pathwise uniqueness and for existence of a strong solution. Secondly, we discuss the significance of these processes in the context of Stochastic Portfolio Theory. We describe specific market models which assume that the dynamics of the stocks' capitalizations is the same as that of the Generalized Volatility-Stabilized processes, and we argue that strong relative arbitrage opportunities may exist in these markets, specifically, we provide multiple examples of portfolios that outperform the market portfolio. Moreover, we examine the properties of market weights as well as the diversity weighted portfolio in these models. Thirdly, we provide some asymptotic results for these processes which allows us to describe different properties of the corresponding market models based on these processes.Statisticsrp2424Statistics, MathematicsDissertationsCredit Risk Modeling and Analysis Using Copula Method and Changepoint Approach to Survival Data
http://academiccommons.columbia.edu/catalog/ac:161682
Qian, Bohttp://hdl.handle.net/10022/AC:P:20510Thu, 30 May 2013 00:00:00 +0000This thesis consists of two parts. The first part uses Gaussian Copula and Student's t Copula as the main tools to model the credit risk in securitizations and re-securitizations. The second part proposes a statistical procedure to identify changepoints in Cox model of survival data. The recent 2007-2009 financial crisis has been regarded as the worst financial crisis since the Great Depression by leading economists. The securitization sector took a lot of blame for the crisis because of the connection of the securitized products created from mortgages to the collapse of the housing market. The first part of this thesis explores the relationship between securitized mortgage products and the 2007-2009 financial crisis using the Copula method as the main tool. We show in this part how loss distributions of securitizations and re-securitizations can be derived or calculated in a new model. Simulations are conducted to examine the effectiveness of the model. As an application, the model is also used to examine whether and where the ratings of securitized products could be flawed. On the other hand, the lag effect and saturation effect problems are common and important problems in survival data analysis. They belong to a general class of problems where the treatment effect takes occasional jumps instead of staying constant throughout time. Therefore, they are essentially the changepoint problems in statistics. The second part of this thesis focuses on extending Lai and Xing's recent work in changepoint modeling, which was developed under a time series and Bayesian setup, to the lag effect problems in survival data. A general changepoint approach for Cox model is developed. Simulations and real data analyses are conducted to illustrate the effectiveness of the procedure and how it should be implemented and interpreted.Statisticsbq2102StatisticsDissertationsEstimation and Testing Methods for Monotone Transformation Models
http://academiccommons.columbia.edu/catalog/ac:188499
Zhang, Junyihttp://dx.doi.org/10.7916/D8348JQDThu, 23 May 2013 00:00:00 +0000This thesis deals with a general class of transformation models that contains many important semiparametric regression models as special cases. It develops a self-induced smoothing method for estimating the regression coefficients of these models, resulting in simultaneous point and variance estimations. The self-induced smoothing does not require bandwidth selection, yet provides the right amount of smoothness so that the estimator is asymptotically normal with mean zero (unbiased) and variance-covariance matrix consistently estimated by the usual sandwich-type estimator. An iterative algorithm is given for the variance estimation and shown to numerically converge to a consistent limiting variance estimator. The self-induced smoothing method is also applied to selecting the non-zero regression coefficients for the monotone transformation models. The resulting regularized estimator is shown to be root-n-consistent and achieve desirable sparsity and asymptotic normality under certain regularity conditions. The smoothing technique is used to estimate the monotone transformation function as well. The smoothed rank-based estimate of the transformation function is uniformly consistent and converges weakly to a Gaussian process which is the same as the limiting process for that without smoothing. An explicit covariance function estimate is obtained by using the smoothing technique, and shown to be consistent. The estimation of the transformation function reduces the multiple hypotheses testing problems for the monotone transformation models to those for linear models. A new hypotheses testing procedure is proposed in this thesis for linear models and shown to be more powerful than some widely-used testing methods when there is a strong collinearity in data. It is proved that the new testing procedure controls the family-wise error rate.Statisticsjz2299StatisticsDissertationsStatistical Inference for Diagnostic Classification Models
http://academiccommons.columbia.edu/catalog/ac:160464
Xu, Gongjunhttp://hdl.handle.net/10022/AC:P:20058Tue, 30 Apr 2013 00:00:00 +0000Diagnostic classification models (DCM) are an important recent development in educational and psychological testing. Instead of an overall test score, a diagnostic test provides each subject with a profile detailing the concepts and skills (often called "attributes") that he/she has mastered. Central to many DCMs is the so-called Q-matrix, an incidence matrix specifying the item-attribute relationship. It is common practice for the Q-matrix to be specified by experts when items are written, rather than through data-driven calibration. Such a non-empirical approach may lead to misspecification of the Q-matrix and substantial lack of model fit, resulting in erroneous interpretation of testing results. This motivates our study and we consider the identifiability, estimation, and hypothesis testing of the Q-matrix. In addition, we study the identifiability of diagnostic model parameters under a known Q-matrix. The first part of this thesis is concerned with estimation of the Q-matrix. In particular, we present definitive answers to the learnability of the Q-matrix for one of the most commonly used models, the DINA model, by specifying a set of sufficient conditions under which the Q-matrix is identifiable up to an explicitly defined equivalence class. We also present the corresponding data-driven construction of the Q-matrix. The results and analysis strategies are general in the sense that they can be further extended to other diagnostic models. The second part of the thesis focuses on statistical validation of the Q-matrix. The purpose of this study is to provide a statistical procedure to help decide whether to accept the Q-matrix provided by the experts. Statistically, this problem can be formulated as a pure significance testing problem with null hypothesis H0 : Q = Q0, where Q0 is the candidate Q-matrix. We propose a test statistic that measures the consistency of observed data with the proposed Q-matrix. Theoretical properties of the test statistic are studied. In addition, we conduct simulation studies to show the performance of the proposed procedure. The third part of this thesis is concerned with the identifiability of the diagnostic model parameters when the Q-matrix is correctly specified. Identifiability is a prerequisite for statistical inference, such as parameter estimation and hypothesis testing. We present sufficient and necessary conditions under which the model parameters are identifiable from the response data.Statistics, Educational tests and measurementsgx2108StatisticsDissertationsBayesian Model Selection in terms of Kullback-Leibler discrepancy
http://academiccommons.columbia.edu/catalog/ac:158374
Zhou, Shouhaohttp://hdl.handle.net/10022/AC:P:19157Mon, 25 Feb 2013 00:00:00 +0000In this article we investigate and develop the practical model assessment and selection methods for Bayesian models, when we anticipate that a promising approach should be objective enough to accept, easy enough to understand, general enough to apply, simple enough to compute and coherent enough to interpret. We mainly restrict attention to the Kullback-Leibler divergence, a widely applied model evaluation measurement to quantify the similarity between the proposed candidate model and the underlying true model, where the true model is only referred to a probability distribution as the best projection onto the statistical modeling space once we try to understand the real but unknown dynamics/mechanism of interest. In addition to review and discussion on the advantages and disadvantages of the historically and currently prevailing practical model selection methods in literature, a series of convenient and useful tools, each designed and applied for different purposes, are proposed to asymptotically unbiasedly assess how the candidate Bayesian models are favored in terms of predicting a future independent observation. What's more, we also explore the connection of the Kullback-Leibler based information criterion to the Bayes factors, another most popular Bayesian model comparison approaches, after seeing the motivation through the developments of the Bayes factor variants. In general, we expect to provide a useful guidance for researchers who are interested in conducting Bayesian data analysis.Statisticssz2020StatisticsDissertationsMultiplicative Multiresolution Analysis for Lie-group Valued Data Indexed by a Euclidean Parameter
http://academiccommons.columbia.edu/catalog/ac:155756
Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:15397Wed, 12 Dec 2012 00:00:00 +0000Lie-valued euclidean indexed data. These data might be: phase angles as functions of time or space, for example compass directions; 3D orientations of a rigid frame of reference as a function of time or space; or, quaternions as a function of time or space. This can also be extended to quotients of lie groups which gives us the ability to model points on S2, the unit sphere, as functions of time or space.Computer science, Statisticsvcs2115StatisticsPresentationsBayesian Statistical Pragmatism
http://academiccommons.columbia.edu/catalog/ac:154737
Gelman, Andrew E.http://hdl.handle.net/10022/AC:P:15340Tue, 20 Nov 2012 00:00:00 +0000I agree with Rob Kass’ point that we can and should make use of statistical methods developed under different philosophies, and I am happy to take the opportunity to elaborate on some of his arguments.Statisticsag389 Political Science, StatisticsArticlesMultiple Imputation with Diagnostics (mi) in R: Opening Windows into the Black Box
http://academiccommons.columbia.edu/catalog/ac:154731
Su, Yu-Sung; Yajima, Masanao; Gelman, Andrew E.; Hill, Jenniferhttp://hdl.handle.net/10022/AC:P:15342Tue, 20 Nov 2012 00:00:00 +0000Our mi package in R has several features that allow the user to get inside the imputation process and evaluate the reasonableness of the resulting models and imputations. These features include: choice of predictors, models, and transformations for chained imputation models; standard and binned residual plots for checking the fit of the conditional distributions used for imputation; and plots for comparing the distributions of observed and imputed data. In addition, we use Bayesian models and weakly informative prior distributions to construct more stable estimates of imputation models. Our goal is to have a demonstration package that (a) avoids many of the practical problems that arise with existing multivariate imputation programs, and (b) demonstrates state-of-the-art diagnostics that can be applied more generally and can be incorporated into the software of others.Statisticsag389 Political Science, StatisticsArticlesR2WinBUGS: A Package for Running WinBUGS from R
http://academiccommons.columbia.edu/catalog/ac:154734
Sturtz, Sibylle; Ligges, Uwe; Gelman, Andrew E.http://hdl.handle.net/10022/AC:P:15341Tue, 20 Nov 2012 00:00:00 +0000The R2WinBUGS package provides convenient functions to call WinBUGS from R. It automatically writes the data and scripts in a format readable by WinBUGS for processing in batch mode, which is possible since version 1.4. After the WinBUGS process has finished, it is possible either to read the resulting data into R by the package itself—which gives a compact graphical summary of inference and convergence diagnostics—or to use the facilities of the coda package for further analyses of the output. Examples are given to demonstrate the usage of this package.Statisticsag389 Political Science, StatisticsArticlesSegregation in Social Networks Based on Acquaintanceship and Trust
http://academiccommons.columbia.edu/catalog/ac:154740
DiPrete, Thomas A.; Gelman, Andrew E.; McCormick, Tyler; Teitler, Julien O.; Zheng, Tianhttp://hdl.handle.net/10022/AC:P:15339Tue, 20 Nov 2012 00:00:00 +0000Using 2006 General Social Survey data, the authors compare levels of segregation by race and along other dimensions of potential social cleavage in the contemporary United States. Americans are not as isolated as the most extreme recent estimates suggest. However, hopes that “bridging” social capital is more common in broader acquaintanceship networks than in core networks are not supported. Instead, the entire acquaintanceship network is perceived by Americans to be about as segregated as the much smaller network of close ties. People do not always know the religiosity, political ideology, family behaviors, or socioeconomic status of their acquaintances, but perceived social divisions on these dimensions are high, sometimes rivaling racial segregation in acquaintanceship networks. The major challenge to social integration today comes from the tendency of many Americans to isolate themselves from others who differ on race, political ideology, level of religiosity, and other salient aspects of social identity.Statisticstad61, ag389 , thm2105, jot8, tz33Political Science, Sociology, Statistics, Social WorkArticlesContributions to Semiparametric Inference to Biased-Sampled and Financial Data
http://academiccommons.columbia.edu/catalog/ac:177018
Sit, Tonyhttp://hdl.handle.net/10022/AC:P:14685Wed, 12 Sep 2012 00:00:00 +0000This thesis develops statistical models and methods for the analysis of life-time and financial data under the umbrella of semiparametric framework. The first part studies the use of empirical likelihood on Levy processes that are used to model the dynamics exhibited in the financial data. The second part is a study of inferential procedure for survival data collected under various biased sampling schemes in transformation and the accelerated failure time models. During the last decade Levy processes with jumps have received increasing popularity for modelling market behaviour for both derivative pricing and risk management purposes. Chan et al. (2009) introduced the use of empirical likelihood methods to estimate the parameters of various diffusion processes via their characteristic functions which are readily available in most cases. Return series from the market are used for estimation. In addition to the return series, there are many derivatives actively traded in the market whose prices also contain information about parameters of the underlying process. This observation motivates us to combine the return series and the associated derivative prices observed at the market so as to provide a more reflective estimation with respect to the market movement and achieve a gain in efficiency. The usual asymptotic properties, including consistency and asymptotic normality, are established under suitable regularity conditions. We performed simulation and case studies to demonstrate the feasibility and effectiveness of the proposed method. The second part of this thesis investigates a unified estimation method for semiparametric linear transformation models and accelerated failure time model under general biased sampling schemes. The methodology proposed is first investigated in Paik (2009) in which the length-biased case is considered for transformation models. The new estimator is obtained from a set of counting process-based unbiased estimating equations, developed through introducing a general weighting scheme that offsets the sampling bias. The usual asymptotic properties, including consistency and asymptotic normality, are established under suitable regularity conditions. A closed-form formula is derived for the limiting variance and the plug-in estimator is shown to be consistent. We demonstrate the unified approach through the special cases of left truncation, length-bias, the case-cohort design and variants thereof. Simulation studies and applications to real data sets are also presented.Statisticsts2500StatisticsDissertationsDetecting Dependence Change Points in Multivariate Time Series with Applications in Neuroscience and Finance
http://academiccommons.columbia.edu/catalog/ac:177012
Cribben, Ivor Johnhttp://hdl.handle.net/10022/AC:P:14681Wed, 12 Sep 2012 00:00:00 +0000In many applications there are dynamic changes in the dependency structure between multivariate time series. Two examples include neuroscience and finance. The second and third chapters focus on neuroscience and introduce a data-driven technique for partitioning a time course into distinct temporal intervals with different multivariate functional connectivity patterns between a set of brain regions of interest (ROIs). The technique, called Dynamic Connectivity Regression (DCR), detects temporal change points in functional connectivity and estimates a graph, or set of relationships between ROIs, for data in the temporal partition that falls between pairs of change points. Hence, DCR allows for estimation of both the time of change in connectivity and the connectivity graph for each partition, without requiring prior knowledge of the nature of the experimental design. Permutation and bootstrapping methods are used to perform inference on the change points. In the second chapter of this work, we focus on multi-subject data while in the third chapter, we concentrate on single-subject data and extend the DCR methodology in two ways: (i) we alter the algorithm to make it more accurate for individual subject data with a small number of observations and (ii) we perform inference on the edges or connections between brain regions in order to reduce the number of false positives in the graphs. We also discuss a Likelihood Ratio test to compare precision matrices (inverse covariance matrices) across subjects as well as a test across subjects on the single edges or partial correlations in the graph. In the final chapter of this work, we turn to a finance setting. We use the same DCR technique to detect changes in dependency structure in multivariate financial time series for situations where both the placement and number of change points is unknown. In this setting, DCR finds the dependence change points and estimates an undirected graph representing the relationship between time series within each interval created by pairs of adjacent change points. A shortcoming of the proposed DCR methodology is the presence of an excessive number of false positive edges in the undirected graphs, especially when the data deviates from normality. Here we address this shortcoming by proposing a procedure for performing inference on the edges, or partial dependencies between time series, that effectively removes false positive edges. We also discuss two robust estimation procedures based on ranks and the tlasso (Finegold and Drton, 2011) technique, which we contrast with the glasso technique used by DCR.Statisticsijc2104StatisticsDissertationsModeling Strategies for Large Dimensional Vector Autoregressions
http://academiccommons.columbia.edu/catalog/ac:152472
Zang, Pengfeihttp://hdl.handle.net/10022/AC:P:14666Tue, 11 Sep 2012 00:00:00 +0000The vector autoregressive (VAR) model has been widely used for describing the dynamic behavior of multivariate time series. However, fitting standard VAR models to large dimensional time series is challenging primarily due to the large number of parameters involved. In this thesis, we propose two strategies for fitting large dimensional VAR models. The first strategy involves reducing the number of non-zero entries in the autoregressive (AR) coefficient matrices and the second is a method to reduce the effective dimension of the white noise covariance matrix. We propose a 2-stage approach for fitting large dimensional VAR models where many of the AR coefficients are zero. The first stage provides initial selection of non-zero AR coefficients by taking advantage of the properties of partial spectral coherence (PSC) in conjunction with BIC. The second stage, based on $t$-ratios and BIC, further refines the spurious non-zero AR coefficients post first stage. Our simulation study suggests that the 2-stage approach outperforms Lasso-type methods in discovering sparsity patterns in AR coefficient matrices of VAR models. The performance of our 2-stage approach is also illustrated with three real data examples. Our second strategy for reducing the complexity of a large dimensional VAR model is based on a reduced-rank estimator for the white noise covariance matrix. We first derive the reduced-rank covariance estimator under the setting of independent observations and give the analytical form of its maximum likelihood estimate. Then we describe how to integrate the proposed reduced-rank estimator into the fitting of large dimensional VAR models, where we consider two scenarios that require different model fitting procedures. In the VAR modeling context, our reduced-rank covariance estimator not only provides interpretable descriptions of the dependence structure of VAR processes but also leads to improvement in model-fitting and forecasting over unrestricted covariance estimators. Two real data examples are presented to illustrate these fitting procedures.Statisticspz2146StatisticsDissertationsSome Models for Time Series of Counts
http://academiccommons.columbia.edu/catalog/ac:152149
Liu, Henghttp://hdl.handle.net/10022/AC:P:14561Wed, 29 Aug 2012 00:00:00 +0000This thesis focuses on developing nonlinear time series models and establishing relevant theory with a view towards applications in which the responses are integer valued. The discreteness of the observations, which is not appropriate with classical time series models, requires novel modeling strategies. The majority of the existing models for time series of counts assume that the observations follow a Poisson distribution conditional on an accompanying intensity process that drives the serial dynamics of the model. According to whether the evolution of the intensity process depends on the observations or solely on an external process, the models are classified into parameter-driven and observation-driven. Compared to the former one, an observation-driven model often allows for easier and more straightforward estimation of the model parameters. On the other hand, the stability properties of the process, such as the existence and uniqueness of a stationary and ergodic solution that are required for deriving asymptotic theory of the parameter estimates, can be quite complicated to establish, as compared to parameter-driven models. In this thesis, we first propose a broad class of observation-driven models that is based upon a one-parameter exponential family of distributions and incorporates nonlinear dynamics. The establishment of stability properties of these processes, which is at the heart of this thesis, is addressed by employing theory from iterated random functions and coupling techniques. Using this theory, we are also able to obtain the asymptotic behavior of maximum likelihood estimates of the parameters. Extensions of the base model in several directions are considered. Inspired by the idea of a self-excited threshold ARMA process, a threshold Poisson autoregression is proposed. It introduces a two-regime structure in the intensity process and essentially allows for modeling negatively correlated observations. E-chain, a non-standard Markov chain technique and Lyapunov's method are utilized to show the stationarity and a law of large numbers for this process. In addition, the model has been adapted to incorporate covariates, an important problem of practical and primary interest. The base model is also extended to consider the case of multivariate time series of counts. Given a suitable definition of a multivariate Poisson distribution, a multivariate Poisson autoregression process is described and its properties studied. Several simulation studies are presented to illustrate the inference theory. The proposed models are also applied to several real data sets, including the number of transactions of the Ericsson stock, the return times of Goldman Sachs Group stock prices, the number of road crashes in Schiphol, the frequencies of occurrences of gold particles, the incidences of polio in the US and the number of presentations of asthma in an Australian hospital. An array of graphical and quantitative diagnostic tools, which is specifically designed for the evaluation of goodness of fit for time series of counts models, is described and illustrated with these data sets.Statisticshl2494StatisticsDissertationsStatistical inference in two non-standard regression problems
http://academiccommons.columbia.edu/catalog/ac:151460
Seijo, Emilio Franciscohttp://hdl.handle.net/10022/AC:P:14317Wed, 08 Aug 2012 00:00:00 +0000This thesis analyzes two regression models in which their respective least squares estimators have nonstandard asymptotics. It is divided in an introduction and two parts. The introduction motivates the study of nonstandard problems and presents an outline of the contents of the remaining chapters. In part I, the least squares estimator of a multivariate convex regression function is studied in great detail. The main contribution here is a proof of the consistency of the aforementioned estimator in a completely nonparametric setting. Model misspecification, local rates of convergence and multidimensional regression models mixing convexity and componentwise monotonicity constraints will also be considered. Part II deals with change-point regression models and the issues that might arise when applying the bootstrap to these problems. The classical bootstrap is shown to be inconsistent on a simple change-point regression model, and an alternative (smoothed) bootstrap procedure is proposed and proved to be consistent. The superiority of the alternative method is also illustrated through a simulation study. In addition, a version of the continuous mapping theorem specially suited for change-point estimators is proved and used to derive the results concerning the bootstrap.Statistics, Applied mathematics, Mathematicsefs2113StatisticsDissertationsFast l1 Minimization for Genomewide Analysis of mRNA Lengths
http://academiccommons.columbia.edu/catalog/ac:140172
Drori, Iddo; Stodden, Victoria C.; Hurowitz, Evan H.Tue, 11 Oct 2011 00:00:00 +0000Application of the virtual northern method to human mRNA allows us to systematically measure transcript length on a genome-wide scale [1]. Characterization of RNA transcripts by length provides a measurement which complements cDNA sequencing. We have robustly extracted the lengths of the transcripts expressed by each gene for comparison with the Unigene, Refseq, and H-Invitational databases [2, 3]. Obtaining an accurate probability for each peak requires performing multiple bootstrap simulations, each involving a deconvolution operation which is equivalent to finding the sparsest non-negative solution of an underdetermined system of equations. This process is computationally intensive for a large number of simulations and genes. In this contribution we present an efficient approximation method which is faster than general purpose solvers by two orders of magnitude, and in practice reduces our processing time from a week to hours.Genetics, Statisticsvcs2115StatisticsArticlesMultiscale Representations for Manifold-Valued Data
http://academiccommons.columbia.edu/catalog/ac:140178
Rahman, Inam Ur; Drori, Iddo; Stodden, Victoria C.; Donoho, David L.; Schroeder, Peterhttp://hdl.handle.net/10022/AC:P:11434Tue, 11 Oct 2011 00:00:00 +0000We describe multiscale representations for data observed on equispaced grids and taking values in manifolds such as: the sphere S2, the special orthogonal group SO(3), the positive definite matrices SPD(n), and the Grassmann manifolds G(n, k). The representations are based on the deployment of Deslauriers-Dubuc and Average Interpolating pyramids "in the tangent plane" of such manifolds, using the Exp and Log maps of those manifolds. The representations provide "wavelet coefficients" which can be thresholded, quantized, and scaled much as traditional wavelet coefficients. Tasks such as compression, noise removal, contrast enhancement, and stochastic simulation are facilitated by this representation. The approach applies to general manifolds, but is particularly suited to the manifolds we consider, i.e. Riemanian symmetric spaces, such as Sn−1, SO(n), G(n, k), where the Exp and Log maps are effectively computable. Applications to manifold-valued data sources of a geometric nature (motion, orientation, diffusion) seem particularly immediate. A software toolbox, SymmLab, can reproduce the results discussed in this paper.Statisticsvcs2115StatisticsArticlesBreakdown Point of Model Selection When the Number of Variables Exceeds the Number of Observations
http://academiccommons.columbia.edu/catalog/ac:140168
Donoho, David L.; Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:11431Tue, 11 Oct 2011 00:00:00 +0000The classical multivariate linear regression problem assumes p variables X1, X2, ... , Xp and a response vector y, each with n observations, and a linear relationship between the two: y = X beta + z, where z ~ N(0, sigma2). We point out that when p > n, there is a breakdown point for standard model selection schemes, such that model selection only works well below a certain critical complexity level depending on n/p. We apply this notion to some standard model selection algorithms (Forward Stepwise, LASSO, LARS) in the case where pGtn. We find that 1) the breakdown point is well-de ned for random X-models and low noise, 2) increasing noise shifts the breakdown point to lower levels of sparsity, and reduces the model recovery ability of the algorithm in a systematic way, and 3) below breakdown, the size of coefficient errors follows the theoretical error distribution for the classical linear model.Statisticsvcs2115StatisticsArticlesWhen Does Non-Negative Matrix Factorization Give a Correct Decomposition into Parts?
http://academiccommons.columbia.edu/catalog/ac:140175
Donoho, David L.; Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:11433Tue, 11 Oct 2011 00:00:00 +0000We interpret non-negative matrix factorization geometrically, as the problem of finding a simplicial cone which contains a cloud of data points and which is contained in the positive orthant. We show that under certain conditions, basically requiring that some of the data are spread across the faces of the positive orthant, there is a unique such simplicial cone. We give examples of synthetic image articulation databases which obey these conditions; these require separated support and factorial sampling. For such databases there is a generative model in terms of "parts" and NMF correctly identifies the "parts". We show that our theoretical results are predictive of the performance of published NMF code, by running the published algorithms on one of our synthetic image articulation databases.Statisticsvcs2115StatisticsArticlesSelf-controlled methods for postmarketing drug safety surveillance in large-scale longitudinal data
http://academiccommons.columbia.edu/catalog/ac:137551
Simpson, Shawn E.http://hdl.handle.net/10022/AC:P:10963Mon, 22 Aug 2011 00:00:00 +0000A primary objective in postmarketing drug safety surveillance is to ascertain the relationship between time-varying drug exposures and adverse events (AEs) related to health outcomes. Surveillance can be based on longitudinal observational databases (LODs), which contain time-stamped patient-level medical information including periods of drug exposure and dates of diagnoses. Due to its desirable properties, we focus on the self-controlled case series (SCCS) method for analysis in this context. SCCS implicitly controls for fixed multiplicative baseline covariates since each individual acts as their own control. In addition, only exposed cases are required for the analysis, which is computationally advantageous. In the first part of this work we present how the simple SCCS model can be applied to the surveillance problem, and compare the results of simple SCCS to those of existing methods. Many current surveillance methods are based on marginal associations between drug exposures and AEs. Such analyses ignore confounding drugs and interactions and have the potential to give misleading results. In order to avoid these difficulties, it is desirable for an analysis strategy to incorporate large numbers of time-varying potential confounders such as other drugs. In the second part of this work we propose the Bayesian multiple SCCS approach, which deals with high dimensionality and can provide a sparse solution via a Laplacian prior. We present details of the model and optimization procedure, as well as results of empirical investigations. SCCS is based on a conditional Poisson regression model, which assumes that events at different time points are conditionally independent given the covariate process. This requirement is problematic when the occurrence of an event can alter the future event risk. In a clinical setting, for example, patients who have a first myocardial infarction (MI) may be at higher subsequent risk for a second. In the third part of this work we propose the positive dependence self-controlled case series (PD-SCCS) method: a generalization of SCCS that allows the occurrence of an event to increase the future event risk, yet maintains the advantages of the original by controlling for fixed baseline covariates and relying solely on data from cases. We develop the model and compare the results of PD-SCCS and SCCS on example drug-AE pairs.Statisticsses2155StatisticsDissertationsSome Nonparametric Methods for Clinical Trials and High Dimensional Data
http://academiccommons.columbia.edu/catalog/ac:174242
Wu, Xiaoruhttp://hdl.handle.net/10022/AC:P:10335Wed, 11 May 2011 00:00:00 +0000This dissertation addresses two problems from novel perspectives. In chapter 2, I propose an empirical likelihood based method to nonparametrically adjust for baseline covariates in randomized clinical trials and in chapter 3, I develop a survival analysis framework for multivariate K-sample problems. (I): Covariate adjustment is an important tool in the analysis of randomized clinical trials and observational studies. It can be used to increase efficiency and thus power, and to reduce possible bias. While most statistical tests in randomized clinical trials are nonparametric in nature, approaches for covariate adjustment typically rely on specific regression models, such as the linear model for a continuous outcome, the logistic regression model for a dichotomous outcome, and the Cox model for survival time. Several recent efforts have focused on model-free covariate adjustment. This thesis makes use of the empirical likelihood method and proposes a nonparametric approach to covariate adjustment. A major advantage of the new approach is that it automatically utilizes covariate information in an optimal way without fitting a nonparametric regression. The usual asymptotic properties, including the Wilks-type result of convergence to a chi-square distribution for the empirical likelihood ratio based test, and asymptotic normality for the corresponding maximum empirical likelihood estimator, are established. It is also shown that the resulting test is asymptotically most powerful and that the estimator for the treatment effect achieves the semiparametric efficiency bound. The new method is applied to the Global Use of Strategies to Open Occluded Coronary Arteries (GUSTO)-I trial. Extensive simulations are conducted, validating the theoretical findings. This work is not only useful for nonparametric covariate adjustment but also has theoretical value. It broadens the scope of the traditional empirical likelihood inference by allowing the number of constraints to grow with the sample size. (II): Motivated by applications in high-dimensional settings, I propose a novel approach to testing equality of two or more populations by constructing a class of intensity centered score processes. The resulting tests are analogous in spirit to the well-known class of weighted log-rank statistics that is widely used in survival analysis. The test statistics are nonparametric, computationally simple and applicable to high-dimensional data. We establish the usual large sample properties by showing that the underlying log-rank score process converges weakly to a Gaussian random field with zero mean under the null hypothesis, and with a drift under the contiguous alternatives. For the Kolmogorov-Smirnov-type and the Cramer-von Mises-type statistics, we also establish the consistency result for any fixed alternative. As a practical means to obtain approximate cutoff points for the test statistics, a simulation based resampling method is proposed, with theoretical justification given by establishing weak convergence for the randomly weighted log-rank score process. The new approach is applied to a study of brain activation measured by functional magnetic resonance imaging when performing two linguistic tasks and also to a prostate cancer DNA microarray data set.Statisticsxw2144StatisticsDissertationsStatistical methods for indirectly observed network data
http://academiccommons.columbia.edu/catalog/ac:131447
McCormick, Tyler H.http://hdl.handle.net/10022/AC:P:10239Fri, 29 Apr 2011 00:00:00 +0000Social networks have become an increasingly common framework for understanding and explaining social phenomena. Yet, despite an abundance of sophisticated models, social network research has yet to realize its full potential, in part because of the difficulty of collecting social network data. In many cases, particularly in the social sciences, collecting complete network data is logistically and financially challenging. In contrast, Aggregated Relational Data (ARD) measure network structure indirectly by asking respondents how many connections they have with members of a certain subpopulation (e.g. How many individuals with HIV/AIDS do you know?). These data require no special sampling procedure and are easily incorporated into existing surveys. This research develops a latent space model for ARD. This dissertation proposes statistical methods for methods for estimating social network and population characteristics using one type of social network data collected using standard surveys. First, a method to estimate both individual social network size (i.e., degree) and the distribution of network sizes in a population is prosed. A second method estimates the demographic characteristics of hard-to-reach groups, or latent demographic profiles. These groups, such as those with HIV/AIDS, unlawful immigrants, or the homeless, are often excluded from the sampling frame of standard social science surveys. A third method develops a latent space model for ARD. This method is similar in spirit to previous latent space models for networks (see Hoff, Raftery and Handcock (2002), for example) in that the dependence structure of the network is represented parsimoniously in a multidimensional geometric space. The key distinction from the complete network case is that instead of conditioning on the (latent) distance between two members of the network, the latent space model for ARD conditions on the expected distance between a survey respondent and the center of a subpopulation in the latent space. A spherical latent space facilitates tractable computation of this expectation. This model estimates relative homogeneity between groups in the population and variation in the propensity for interaction between respondents and group members.Statisticsthm2105StatisticsDissertationsContagion and Systemic Risk in Financial Networks
http://academiccommons.columbia.edu/catalog/ac:131474
Moussa, Amalhttp://hdl.handle.net/10022/AC:P:10249Fri, 29 Apr 2011 00:00:00 +0000The 2007-2009 financial crisis has shed light on the importance of contagion and systemic risk, and revealed the lack of adequate indicators for measuring and monitoring them. This dissertation addresses these issues and leads to several recommendations for the design of an improved assessment of systemic importance, improved rating methods for structured finance securities, and their use by investors and risk managers. Using a complete data set of all mutual exposures and capital levels of financial institutions in Brazil in 2007 and 2008, we explore in chapter 2 the structure and dynamics of the Brazilian financial system. We show that the Brazilian financial system exhibits a complex network structure characterized by a strong degree of heterogeneity in connectivity and exposure sizes across institutions, which is qualitatively and quantitatively similar to the statistical features observed in other financial systems. We find that the Brazilian financial network is well represented by a directed scale-free network, rather than a small world network. Based on these observations, we propose a stochastic model for the structure of banking networks, representing them as a directed weighted scale free network with power law distributions for in-degree and out-degree of nodes, Pareto distribution for exposures. This model may then be used for simulation studies of contagion and systemic risk in networks. We propose in chapter 3 a quantitative methodology for assessing contagion and systemic risk in a network of interlinked institutions. We introduce the Contagion Index as a metric of the systemic importance of a single institution or a set of institutions, that combines the effects of both common market shocks to portfolios and contagion through counterparty exposures. Using a directed scale-free graph simulation of the financial system, we study the sensitivity of contagion to a change in aggregate network parameters: connectivity, concentration of exposures, heterogeneity in degree distribution and network size. More concentrated and more heterogeneous networks are found to be more resilient to contagion. The impact of connectivity is more controversial: in well-capitalized networks, increasing connectivity improves the resilience to contagion when the initial level of connectivity is high, but increases contagion when the initial level of connectivity is low. In undercapitalized networks, increasing connectivity tends to increase the severity of contagion. We also study the sensitivity of contagion to local measures of connectivity and concentration across counterparties --the counterparty susceptibility and local network frailty-- that are found to have a monotonically increasing relationship with the systemic risk of an institution. Requiring a minimum (aggregate) capital ratio is shown to reduce the systemic impact of defaults of large institutions; we show that the same effect may be achieved with less capital by imposing such capital requirements only on systemically important institutions and those exposed to them. In chapter 4, we apply this methodology to the study of the Brazilian financial system. Using the Contagion Index, we study the potential for default contagion and systemic risk in the Brazilian system and analyze the contribution of balance sheet size and network structure to systemic risk. Our study reveals that, aside from balance sheet size, the network-based local measures of connectivity and concentration of exposures across counterparties introduced in chapter 3, the counterparty susceptibility and local network frailty, contribute significantly to the systemic importance of an institution in the Brazilian network. Thus, imposing an upper bound on these variables could help reducing contagion. We examine the impact of various capital requirements on the extent of contagion in the Brazilian financial system, and show that targeted capital requirements achieve the same reduction in systemic risk with lower requirements in capital for financial institutions. The methodology we proposed in chapter 3 for estimating contagion and systemic risk requires visibility on the entire network structure. Reconstructing bilateral exposures from balance sheets data is then a question of interest in a financial system where bilateral exposures are not disclosed. We propose in chapter 5 two methods to derive a distribution of bilateral exposures matrices. The first method attempts to recover the balance sheet assets and liabilities "sample by sample". Each sample of the bilateral exposures matrix is solution of a relative entropy minimization problem subject to the balance sheet constraints. However, a solution to this problem does not always exist when dealing with sparse sample matrices. Thus, we propose a second method that attempts to recover the assets and liabilities "in the mean". This approach is the analogue of the Weighted Monte Carlo method introduced by Avellaneda et al. (2001). We first simulate independent samples of the bilateral exposures matrix from a relevant prior distribution on the network structure, then we compute posterior probabilities by maximizing the entropy under the constraints that the balance sheet assets and liabilities are recovered in the mean. We discuss the pros and cons of each approach and explain how it could be used to detect systemically important institutions in the financial system. The recent crisis has also raised many questions regarding the meaning of structured finance credit ratings issued by rating agencies and the methodology behind them. Chapter 6 aims at clarifying some misconceptions related to structured finance ratings and how they are commonly interpreted: we discuss the comparability of structured finance ratings with bond ratings, the interaction between the rating procedure and the tranching procedure and its consequences for the stability of structured finance ratings in time. These insights are illustrated in a factor model by simulating rating transitions for CDO tranches using a nested Monte Carlo method. In particular, we show that the downgrade risk of a CDO tranche can be quite different from a bond with same initial rating. Structured finance ratings follow path-dependent dynamics that cannot be adequately described, as usually done, by a matrix of transition probabilities. Therefore, a simple labeling via default probability or expected loss does not discriminate sufficiently their downgrade risk. We propose to supplement ratings with indicators of downgrade risk. To overcome some of the drawbacks of existing rating methods, we suggest a risk-based rating procedure for structured products. Finally, we formulate a series of recommendations regarding the use of credit ratings for CDOs and other structured credit instruments.Finance, Statisticsam2810Industrial Engineering and Operations Research, StatisticsDissertationsWhy we (usually) don't have to worry about multiple comparison
http://academiccommons.columbia.edu/catalog/ac:129500
Gelman, Andrew E.; Hill, Jennifer; Yajima, Masanaohttp://hdl.handle.net/10022/AC:P:9795Wed, 12 Jan 2011 00:00:00 +0000Applied researchers often find themselves making statistical inferences in settings that would seem to require multiple comparisons adjustments. We challenge the Type I error paradigm that underlies these corrections. Moreover we posit that the problem of multiple comparisons can disappear entirely when viewed from a hierarchical Bayesian perspective. We propose building multilevel models in the settings where multiple comparisons arise. Multilevel models perform partial pooling (shifting estimates toward each other), whereas classical procedures typically keep the centers of intervals stationary, adjusting for multiple comparisons by making the intervals wider (or, equivalently, adjusting the p-values corresponding to intervals of fixed width). Thus, multilevel models address the multiple comparisons problem and also yield more efficient estimates, especially in settings with low group-level variation, which is where multiple comparisons are a particular concern.Statisticsag389Political Science, Statistics, Columbia Population Research CenterWorking papersComment: Bayesian Checking of the Second Levels of Hierarchical Models
http://academiccommons.columbia.edu/catalog/ac:125303
Gelman, Andrew E.http://hdl.handle.net/10022/AC:P:8570Wed, 17 Mar 2010 00:00:00 +0000Bayarri and Castellanos (BC) have written an interesting paper discussing two forms of posterior model check, one based on cross-validation and one based on replication of new groups in a hierarchical model. We think both these checks are good ideas and can become even more effective when understood in the context of posterior predictive checking. For the purpose of discussion, however, it is most interesting to focus on the areas where we disagree with BC.Statisticsag389Political Science, StatisticsArticlesStruggles with survey weighting and regression modeling
http://academiccommons.columbia.edu/catalog/ac:125309
Gelman, Andrew E.http://hdl.handle.net/10022/AC:P:8572Wed, 17 Mar 2010 00:00:00 +0000The general principles of Bayesian data analysis imply that models for survey responses should be constructed conditional on all variables that affect the probability of inclusion and nonresponse, which are also the variables used in survey weighting and clustering. However, such models can quickly become very complicated, with potentially thousands of poststratification cells. It is then a challenge to develop general families of multilevel probability models that yield reasonable Bayesian inferences. We discuss in the context of several ongoing public health and social surveys. This work is currently open-ended, and we conclude with thoughts on how research could proceed to solve these problems.Statisticsag389Political Science, StatisticsArticlesRejoinder: Struggles with survey weighting and regression modeling
http://academiccommons.columbia.edu/catalog/ac:125312
Gelman, Andrew E.http://hdl.handle.net/10022/AC:P:8573Wed, 17 Mar 2010 00:00:00 +0000I was motivated to write this paper, with its controversial opening line, "Survey weighting is a mess," from various experiences as an applied statistician.Statisticsag389Political Science, StatisticsArticlesBayesian hierarchical classes analysis
http://academiccommons.columbia.edu/catalog/ac:125300
Leenen, Iwin; Mechelen, Iven van; Gelman, Andrew E.; Knop, Stijn dehttp://hdl.handle.net/10022/AC:P:8569Wed, 17 Mar 2010 00:00:00 +0000Hierarchical classes models are models for N-way N-mode data that represent the association among the N modes and simultaneously yield, for each mode, a hierarchical classification of its elements. In this paper we present a stochastic extension of the hierarchical classes model for two-way two-mode binary data. In line with the original model, the new probabilistic extension still represents both the association among the two modes and the hierarchical classifications. A fully Bayesian method for fitting the new model is presented and evaluated in a simulation study. Furthermore, we propose tools for model selection and model checking based on Bayes factors and posterior predictive checks. We illustrate the advantages of the new approach with applications in the domain of the psychology of choice and psychiatric diagnosis.Statisticsag389Political Science, StatisticsArticlesRich state, poor state, red state, blue state: What's the matter with Connecticut?
http://academiccommons.columbia.edu/catalog/ac:125297
Gelman, Andrew E.; Shor, Boris; Bafumi, Joseph; Park, David K.http://hdl.handle.net/10022/AC:P:8568Wed, 17 Mar 2010 00:00:00 +0000For decades, the Democrats have been viewed as the party of the poor, with the Republicans representing the rich. Recent presidential elections, however, have shown a reverse pattern, with Democrats performing well in the richer blue states in the northeast and coasts, and Republicans dominating in the red states in the middle of the country and the south. Through multilevel modeling of individual-level survey data and county- and state-level demographic and electoral data, we reconcile these patterns. Furthermore, we find that income matters more in red America than in blue America. In poor states, rich people are much more likely than poor people to vote for the Republican presidential candidate, but in rich states (such as Connecticut), income has a very low correlation with vote preference.Political science, Statisticsag389Political Science, StatisticsArticlesBayes: Radical, liberal, or conservative?
http://academiccommons.columbia.edu/catalog/ac:125306
Gelman, Andrew E.http://hdl.handle.net/10022/AC:P:8571Wed, 17 Mar 2010 00:00:00 +0000Statisticsag389Political Science, StatisticsArticlesBayes, Jeffreys, Prior Distributions and the Philosophy of Statistics
http://academiccommons.columbia.edu/catalog/ac:125279
Gelman, Andrew E.http://hdl.handle.net/10022/AC:P:8563Mon, 15 Mar 2010 00:00:00 +0000I actually own a copy of Harold Jeffreys's Theory of Probability but have only read small bits of it, most recently over a decade ago to confirm that, indeed, Jeffreys was not too proud to use a classical chi-squared p-value when he wanted to check the misfit of a model to data (Gelman, Meng and Stern, 2006). I do, however, feel that it is important to understand where our probability models come from, and I welcome the opportunity to use the present article by Robert, Chopin and Rousseau as a platform for further discussion of foundational issues. In this brief discussion I will argue the following: (1) in thinking about prior distributions, we should go beyond Jeffreys's principles and move toward weakly informative priors; (2) it is natural for those of us who work in social and computational sciences to favor complex models, contra Jeffreys's preference for simplicity; and (3) a key generalization of Jeffreys's ideas is to explicitly include model checking in the process of data analysis.Statisticsag389Political Science, StatisticsArticlesPartisans without constraint: Political polarization and trends in American public opinion
http://academiccommons.columbia.edu/catalog/ac:125291
Baldassarri, Delia; Gelman, Andrew E.http://hdl.handle.net/10022/AC:P:8566Mon, 15 Mar 2010 00:00:00 +0000Public opinion polarization is here conceived as a process of alignment along multiple lines of potential disagreement and measured as growing constraint in individuals' preferences. Using NES data from 1972 to 2004, the authors model trends in issue partisanship--the correlation of issue attitudes with party identification--and issue alignment--the correlation between pairs of issues--and find a substantive increase in issue partisanship, but little evidence of issue alignment. The findings suggest that opinion changes correspond more to a resorting of party labels among voters than to greater constraint on issue attitudes: since parties are more polarized, they are now better at sorting individuals along ideological lines. Levels of constraint vary across population subgroups: strong partisans and wealthier and politically sophisticated voters have grown more coherent in their beliefs. The authors discuss the consequences of partisan realignment and group sorting on the political process and potential deviations from the classic pluralistic account of American politics.Political science, Statisticsag389Political Science, StatisticsArticlesPredicting and dissecting the seats-votes curve in the 2006 U.S. House election
http://academiccommons.columbia.edu/catalog/ac:125294
Kastellec, Jonathan P.; Gelman, Andrew E.; Chandler, Jamie P.http://hdl.handle.net/10022/AC:P:8567Mon, 15 Mar 2010 00:00:00 +0000The 2008 U.S. House elections mark the first time since 1994 that the Democrats will seek to retain a majority. With the political climate favoring Democrats this year, it seems almost certain that the party will retain control, and will likely increase its share of seats. In five national polls taken in June of this year, Democrats enjoyed on average a 13-point advantage in the generic congressional ballot; as Bafumi, Erikson, and Wlezien (2007) point out, these early polls, suitably adjusted, are good predictors of the November vote. As of late July, bettors at intrade.com put the probability of the Democrats retaining a majority at about 95% (Intrade.com 2008). Elsewhere in this symposium, Klarner (2008) predicts an 11-seat gain for the Democrats, while Lockerbie (2008) forecasts a 25-seat pickup. In this paper we document how the electoral playing field has shifted from a Republican advantage between 1996 and 2004 to a Democratic tilt today. In an earlier article (Kastellec, Gelman, and Chandler 2008), we predicted the seats-votes curve in the 2006 election, showing how the Democrats faced an uphill battle in their effort to take control of the House and, their victory notwithstanding, ended up winning a lower percentage of seats than their average district vote nationwide. We follow up on this analysis by using the same method to predict the seats-votes curve in 2008. Due to the shift in incumbency advantage from the Republicans to the Democrats, compounded by a greater number of retirements among Republican members, we show that the Democrats now enjoy a partisan bias, and can expect to win more seats than votes for the first time since 1992. While this bias is not as large as the advantage the Republicans held in 2006, it will likely help the Democrats increase their share of seats.Statistics, Political sciencejpk2004, ag389Political Science, StatisticsArticlesThe playing field shifts: Predicting the seats-votes curve in the 2008 U.S. House election
http://academiccommons.columbia.edu/catalog/ac:125285
Kastellec, Jonathan P.; Gelman, Andrew E.; Chandler, Jamie P.http://hdl.handle.net/10022/AC:P:8564Mon, 15 Mar 2010 00:00:00 +0000The 2008 U.S. House elections mark the first time since 1994 that the Democrats will seek to retain a majority. With the political climate favoring Democrats this year, it seems almost certain that the party will retain control, and will likely increase its share of seats. In five national polls taken in June of this year, Democrats enjoyed on average a 13-point advantage in the generic congressional ballot; as Bafumi, Erikson, and Wlezien (2007) point out, these early polls, suitably adjusted, are good predictors of the November vote. As of late July, bettors at intrade.com put the probability of the Democrats retaining a majority at about 95% (Intrade.com 2008). Elsewhere in this symposium, Klarner (2008) predicts an 11-seat gain for the Democrats, while Lockerbie (2008) forecasts a 25-seat pickup. In this paper we document how the electoral playing field has shifted from a Republican advantage between 1996 and 2004 to a Democratic tilt today. In an earlier article (Kastellec, Gelman, and Chandler 2008), we predicted the seats-votes curve in the 2006 election, showing how the Democrats faced an uphill battle in their effort to take control of the House and, their victory notwithstanding, ended up winning a lower percentage of seats than their average district vote nationwide. We follow up on this analysis by using the same method to predict the seats-votes curve in 2008. Due to the shift in incumbency advantage from the Republicans to the Democrats, compounded by a greater number of retirements among Republican members, we show that the Democrats now enjoy a partisan bias, and can expect to win more seats than votes for the first time since 1992. While this bias is not as large as the advantage the Republicans held in 2006, it will likely help the Democrats increase their share of seats.Political science, Statisticsjpk2004, ag389Political Science, StatisticsArticlesDiscussion of the Article "Website Morphing"
http://academiccommons.columbia.edu/catalog/ac:125288
Gelman, Andrew E.http://hdl.handle.net/10022/AC:P:8565Mon, 15 Mar 2010 00:00:00 +0000The article under discussion illustrates the trade-off between optimization and exploration that is fundamental to statistical experimental design. In this discussion, I suggest that the research under discussion could be made even more effective by checking the fit of the model by comparing observed data to replicated data sets simulated from the fitted model.Statisticsag389Political Science, StatisticsArticlesWhy we (usually) don't have to worry about multiple comparisons
http://academiccommons.columbia.edu/catalog/ac:125225
Gelman, Andrew E.; Hill, Jennifer; Yajima, Masanaohttp://hdl.handle.net/10022/AC:P:8550Fri, 12 Mar 2010 00:00:00 +0000Applied researchers often find themselves making statistical inferences in settings that would seem to require multiple comparisons adjustments. We challenge the Type I error paradigm that underlies these corrections. Moreover we posit that the problem of multiple comparisons can disappear entirely when viewed from a hierarchical Bayesian perspective. We propose building multilevel models in the settings where multiple comparisons arise. Multilevel models perform partial pooling (shifting estimates toward each other), whereas classical procedures typically keep the centers of intervals stationary, adjusting for multiple comparisons by making the intervals wider (or, equivalently, adjusting the p-values corresponding to intervals of fixed width). Thus, multilevel models address the multiple comparisons problem and also yield more efficient estimates, especially in settings with low group-level variation, which is where multiple comparisons are a particular concern.Statisticsag389Political Science, StatisticsArticlesThoughts on new statistical procedures for age-period-cohort analyses
http://academiccommons.columbia.edu/catalog/ac:125234
Gelman, Andrew E.http://hdl.handle.net/10022/AC:P:8553Fri, 12 Mar 2010 00:00:00 +0000Statisticsag389Political Science, StatisticsArticlesGoing beyond the book: Toward critical reading in statistics teaching
http://academiccommons.columbia.edu/catalog/ac:125240
Gelman, Andrew E.http://hdl.handle.net/10022/AC:P:8555Fri, 12 Mar 2010 00:00:00 +0000We can improve our teaching of statistical examples from books by collecting further data, reading cited articles, and performing further data analysis. This should not come as a surprise, but what might be new is the realization of how close to the surface these research opportunities are: even influential and celebrated books can have examples where more can be learned with a small amount of additional effort. We discuss three examples that have arisen in our own teaching: an introductory textbook that motivated us to think more carefully about categorical and continuous variables; a book for the lay reader that misreported a study of menstruation and accidents; and a monograph on the foundations of probability that overinterpreted statistically insignificant fluctuations in sex ratios.Political science, Statisticsag389Political Science, StatisticsOne vote, many Mexicos: Income and vote choice in the 1994, 2000, and 2006 presidential elections
http://academiccommons.columbia.edu/catalog/ac:125237
Cortina, Jeronimo; Gelman, Andrew E.; Lasala Blanco, Maria Narayanihttp://hdl.handle.net/10022/AC:P:8554Fri, 12 Mar 2010 00:00:00 +0000Using multilevel modeling of state-level economic data and individual-level exit poll data from the 1994, 2000 and 2006 Mexican presidential elections, we find that income has a stronger effect in predicting the vote for the conservative party in poorer states than in richer states -- a pattern that has also been found in recent U.S. elections. In addition (and unlike in the U.S.), richer states on average tend to support the conservative party at higher rates than poorer states. Our findings raise questions regarding the role that income polarization and region play in vote choice. The electoral results since 1994 reveal that collapsing multiple states into large regions entails significant loss of information that otherwise may uncover sharper and quiet revealing differences in voting patterns between rich and poor states as well as rich and poor individuals within states.Political science, Statisticsag389, ml2362Political Science, StatisticsArticlesBayesian Combination of State Polls and Election Forecasts
http://academiccommons.columbia.edu/catalog/ac:125228
Lock, Kari; Gelman, Andrew E.http://hdl.handle.net/10022/AC:P:8551Fri, 12 Mar 2010 00:00:00 +0000A wide range of potentially useful data are available for election forecasting: the results of previous elections, a multitude of pre-election polls, and predictors such as measures of national and statewide economic performance. How accurate are different forecasts? We estimate predictive uncertainty via analysis of data collected from past elections (actual outcomes, pre-election polls, and model estimates). With these estimated uncertainties, we use Bayesian inference to integrate the various sources of data to form posterior distributions for the state and national two-party Democratic vote shares for the 2008 election. Our key idea is to separately forecast the national popular vote shares and the relative positions of the states. More generally, such an approach could be applied to study changes in public opinion and other phenomena with wide national swings and fairly stable spatial distributions relative to the national average.Political science, Statisticsag389Political Science, StatisticsArticlesFitting Multilevel Models When Predictors and Group Effects Correlate
http://academiccommons.columbia.edu/catalog/ac:125243
Bafumi, Joseph; Gelman, Andrew E.http://hdl.handle.net/10022/AC:P:8556Fri, 12 Mar 2010 00:00:00 +0000Random effects models (that is, regressions with varying intercepts that are modeled with error) are avoided by some social scientists because of potential issues with bias and uncertainty estimates. Particularly, when one or more predictors correlate with the group or unit effects, a key Gauss-Markov assumption is violated and estimates are compromised. However, this problem can easily be solved by including the average of each individual-level predictors in the group-level regression. We explain the solution, demonstrate its effectiveness using simulations, show how it can be applied in some commonly-used statistical software, and discuss its potential for substantive modeling.Statisticsag389Political Science, StatisticsArticlesWhat will we know on Tuesday at 7pm?
http://academiccommons.columbia.edu/catalog/ac:125231
Gelman, Andrew E.; Silver, Natehttp://hdl.handle.net/10022/AC:P:8552Fri, 12 Mar 2010 00:00:00 +0000Political science, Statisticsag389Political Science, StatisticsArticlesWhy we (usually) don't have to worry about multiple comparisons
http://academiccommons.columbia.edu/catalog/ac:125258
Gelman, Andrew E.; Hill, Jennifer; Yajima, Masanaohttp://hdl.handle.net/10022/AC:P:8561Fri, 12 Mar 2010 00:00:00 +0000Statisticsag389Political Science, StatisticsPresentationsFully Bayesian computing
http://academiccommons.columbia.edu/catalog/ac:125246
Kerman, Jouni; Gelman, Andrew E.http://hdl.handle.net/10022/AC:P:8557Fri, 12 Mar 2010 00:00:00 +0000A fully Bayesian computing environment calls for the possibility of defining vector and array objects that may contain both random and deterministic quantities, and syntax rules that allow treating these objects much like any variables or numeric arrays. Working within the statistical package R, we introduce a new object-oriented framework based on a new random variable data type that is implicitly represented by simulations. We seek to be able to manipulate random variables and posterior simulation objects conveniently and transparently and provide a basis for further development of methods and functions that can access these objects directly. We illustrate the use of this new programming environment with several examples of Bayesian computing, including posterior predictive checking and the manipulation of posterior simulations. This new environment is fully Bayesian in that the posterior simulations can be handled directly as random variables.Computer science, Statisticsag389Political Science, StatisticsArticlesSampling for Bayesian computation with large datasets
http://academiccommons.columbia.edu/catalog/ac:125252
Huang, Zaiying; Gelman, Andrew E.http://hdl.handle.net/10022/AC:P:8559Fri, 12 Mar 2010 00:00:00 +0000Multilevel models are extremely useful in handling large hierarchical datasets. However, computation can be a challenge, both in storage and CPU time per iteration of Gibbs sampler or other Markov chain Monte Carlo algorithms. We propose a computational strategy based on sampling the data, computing separate posterior distributions based on each sample, and then combining these to get a consensus posterior inference. With hierarchical data structures, we perform cluster sampling into subsets with the same structures as the original data. This reduces the number of parameters as well as sample size for each separate model fit. We illustrate with examples from climate modeling and newspaper marketing.Statisticsag389Political Science, StatisticsArticlesWhy we (usually) don't have to worry about multiple comparisons
http://academiccommons.columbia.edu/catalog/ac:125255
Gelman, Andrew E.; Hill, Jennifer; Yajima, Masanaohttp://hdl.handle.net/10022/AC:P:8560Fri, 12 Mar 2010 00:00:00 +0000Statisticsag389Political Science, StatisticsPresentationsProtecting minorities in binary elections: A test of storable votes using field data
http://academiccommons.columbia.edu/catalog/ac:125276
Casella, Alessandra M.; Ehrenberg, Shuky; Gelman, Andrew E.; Shen, Jiehttp://hdl.handle.net/10022/AC:P:8562Fri, 12 Mar 2010 00:00:00 +0000Democratic systems are built, with good reason, on majoritarian principles, but their legitimacy requires the protection of strongly held minority preferences. The challenge is to do so while treating every voter equally and preserving aggregate welfare. One possible solution is storable votes: granting each voter a budget of votes to cast as desired over multiple decisions. During the 2006 student elections at Columbia University, we tested a simple version of this idea: voters were asked to rank the importance of the different contests and to choose where to cast a single extra "bonus vote," had one been available. We used these responses to construct distributions of intensities and electoral outcomes, both without and with the bonus vote. Bootstrapping techniques provided estimates of the probable impact of the bonus vote. The bonus vote performs well: when minority preferences are particularly intense, the minority wins at least one of the contests with 15-30 percent probability; and, when the minority wins, aggregate welfare increases with 85-95 percent probability. When majority and minority preferences are equally intense, the effect of the bonus vote is smaller and more variable but on balance still positive.Political science, Statisticsac186, ag389Political Science, Statistics, EconomicsWorking papers