Academic Commons Search Results
https://academiccommons.columbia.edu/catalog?action=index&controller=catalog&f%5Bsubject_facet%5D%5B%5D=Statistics&format=rss&fq%5B%5D=has_model_ssim%3A%22info%3Afedora%2Fldpd%3AContentAggregator%22&q=&rows=500&sort=record_creation_date+desc
Academic Commons Search Resultsen-usTopics in Computational Bayesian Statistics With Applications to Hierarchical Models in Astronomy and Sociology
https://academiccommons.columbia.edu/catalog/ac:8kprr4xgzr
Sahai, Swupnil10.7916/D83R15HQThu, 09 Nov 2017 23:12:54 +0000This thesis includes three parts. The overarching theme is how to analyze structured hierarchical data, with applications to astronomy and sociology. The first part discusses how expectation propagation can be used to parallelize the computation when fitting big hierarchical bayesian models. This methodology is then used to fit a novel, nonlinear mixture model to ultraviolet radiation from various regions of the observable universe. The second part discusses how the Stan probabilistic programming language can be used to numerically integrate terms in a hierarchical bayesian model. This technique is demonstrated on supernovae data to significantly speed up convergence to the posterior distribution compared to a previous study that used a Gibbs-type sampler. The third part builds a formal latent kernel representation for aggregate relational data as a way to more robustly estimate the mixing characteristics of agents in a network. In particular, the framework is applied to sociology surveys to estimate, as a function of ego age, the age and sex composition of the personal networks of individuals in the United States.Statistics, Astronomy, Sociology, Bayesian statistical decision theory, Multilevel models (Statistics)sks2196StatisticsThesesExpansion of a filtration with a stochastic process: a high frequency trading perspective
https://academiccommons.columbia.edu/catalog/ac:tb2rbnzs9t
Neufcourt, Léo10.7916/D8571QKPFri, 13 Oct 2017 22:18:55 +0000A theory of expansion of filtrations has been developed since the 1970s to model dynamic probabilistic problems with asymmetric information. It has found a special echo in mathematical finance around the concept of insider trading, which has appeared in return very convenient for expressing the abstract properties of augmentations of filtrations. Research has historically focused on two particular classes of expansions, initial and progressive expansions, corresponding to additional information generated respectively by a random variable and a random time. Although they can reproduce some stylized facts in the insider trading paradigm, those two types of expansions are too restrictive to model quantitatively dynamic phenomenons of contemporary interest such as the topical high-frequency trading. In order to model such a continuous flow of information Kchia and Protter (2015) introduce augmentations of filtrations where the additional information is generated by a stochastic process.
This thesis complements the pioneering work of Kchia and Protter (2015) with an analysis of the information drift appearing in the transformation of semimartingales, which leads to a quantitative valuation of the additional information. In the preliminary chapters we introduce the general framework of expansions of filtrations and present the information drift as a key proxy to the value of information by characterizing its existence as a no-arbitrage condition and expressing problems the value increase of optimization problems associated with additional information as one of its integrals. The theoretical core of this thesis is formed by two series of convergence theorems for semimartingales and their information drifts under a new topology on filtrations, from which we derive the transformation of semimartingales when the filtration is augmented with a stochastic process as well as a computational method to estimate the information drift. We finally study several dynamical examples of anticipative expansions of a Brownian filtration with stochastic processes, where the information drift does or does not exist, and set the foundations for an ongoing application to estimating the advantage of high-frequency traders on the general market.Statistics, Mathematics, Financeln2294StatisticsThesesEssays on Matching and Weighting for Causal Inference in Observational Studies
https://academiccommons.columbia.edu/catalog/ac:k3j9kd51dt
Resa Juárez, María de los Angeles10.7916/D8959W4HFri, 13 Oct 2017 22:16:12 +0000This thesis consists of three papers on matching and weighting methods for causal inference. The first paper conducts a Monte Carlo simulation study to evaluate the performance of multivariate matching methods that select a subset of treatment and control observations. The matching methods studied are the widely used nearest neighbor matching with propensity score calipers, and the more recently proposed methods, optimal matching of an optimally chosen subset and optimal cardinality matching. The main findings are: (i) covariate balance, as measured by differences in means, variance ratios, Kolmogorov-Smirnov distances, and cross-match test statistics, is better with cardinality matching since by construction it satisfies balance requirements; (ii) for given levels of covariate balance, the matched samples are larger with cardinality matching than with the other methods; (iii) in terms of covariate distances, optimal subset matching performs best; (iv) treatment effect estimates from cardinality matching have lower RMSEs, provided strong requirements for balance, specifically, fine balance, or strength-k balance, plus close mean balance. In standard practice, a matched sample is considered to be balanced if the absolute differences in means of the covariates across treatment groups are smaller than 0.1 standard deviations. However, the simulation results suggest that stronger forms of balance should be pursued in order to remove systematic biases due to observed covariates when a difference in means treatment effect estimator is used. In particular, if the true outcome model is additive then marginal distributions should be balanced, and if the true outcome model is additive with interactions then low-dimensional joints should be balanced.
The second paper focuses on longitudinal studies, where marginal structural models (MSMs) are widely used to estimate the effect of time-dependent treatments in the presence of time-dependent confounders. Under a sequential ignorability assumption, MSMs yield unbiased treatment effect estimates by weighting each observation by the inverse of the probability of their observed treatment sequence given their history of observed covariates. However, these probabilities are typically estimated by fitting a propensity score model, and the resulting weights can fail to adjust for observed covariates due to model misspecification. Also, these weights tend to yield very unstable estimates if the predicted probabilities of treatment are very close to zero, which is often the case in practice. To address both of these problems, instead of modeling the probabilities of treatment, a design-based approach is taken and weights of minimum variance that adjust for the covariates across all possible treatment histories are directly found. For this, the role of weighting in longitudinal studies of treatment effects is analyzed, and a convex optimization problem that can be solved efficiently is defined. Unlike standard methods, this approach makes evident to the investigator the limitations imposed by the data when estimating causal effects without extrapolating. A simulation study shows that this approach outperforms standard methods, providing less biased and more precise estimates of time-varying treatment effects in a variety of settings. The proposed method is used on Chilean educational data to estimate the cumulative effect of attending a private subsidized school, as opposed to a public school, on students’ university admission tests scores.
The third paper is centered on observational studies with multi-valued treatments. Generalizing methods for matching and stratifying to accommodate multi-valued treatments has proven to be a complex task. A natural way to address confounding in this case is by weighting the observations, typically by the inverse probability of treatment weights (IPTW). As in the MSMs case, these weights can be highly variable and produce unstable estimates due to extreme weights. In addition, model misspecification, small sample sizes, and truncation of extreme weights can cause the weights to fail to adjust appropriately for observed confounders. The conditions the weights need to satisfy in order to provide close to unbiased treatment effect estimates with a reduced variability are determined and the convex optimization problem that can be solved in polynomial time to obtain them is defined. A simulation study with different settings is conducted to compare the proposed weighting scheme to IPTW, including generalized propensity score estimation methods that also consider explicitly the covariate balance problem in the probability estimation process. The applicability of the methods to continuous treatments is also tested. The results show that directly targeting balance with the weights, instead of focusing on estimating treatment assignment probabilities, provides the best results in terms of bias and root mean square error of the treatment effect estimator. The effects of the intensity level of the 2010 Chilean earthquake on posttraumatic stress disorder are estimated using the proposed methodology.Statistics, Inference, Statistical matching, Probabilitiesmdr2146StatisticsThesesEfficient Estimation of the Expectation of a Latent Variable in the Presence of Subject-Specific Ancillaries
https://academiccommons.columbia.edu/catalog/ac:cz8w9ghx4p
Mittel, Louis Buchalter10.7916/D8JW8SFBFri, 13 Oct 2017 16:18:26 +0000Latent variables are often included in a model in order to capture the diversity among subjects in a population. Sometimes the distribution of these latent variables are of principle interest. In studies where sequences of observations are taken from subjects, ancillary variables, such as the number of observations provided by each subject, usually also vary between subjects. The goal here is to understand efficient estimation of the expectation of the latent variable in the presence of these subject-specific ancillaries.
Unbiased estimation and efficient estimation of the expectation of the latent parameter depend on the dependence structure of these three subject-specific components: latent variable, sequence of observations, and ancillary. This dissertation considers estimation under two dependence configurations. In Chapter 3, efficiency is studied under the model in which no assumptions are made about the joint distribution of the latent variable and the subject-specific ancillary. Chapter 4 treats the setting where the ancillary variable and the latent variable are independent.Statistics, Latent variables, Estimation theorylbm2126StatisticsThesesMarginal Screening on Survival Data
https://academiccommons.columbia.edu/catalog/ac:nvx0k6djk7
Huang, Tzu Jung10.7916/D85H7TSNMon, 09 Oct 2017 19:16:58 +0000This work develops a marginal screening test to detect the presence of significant predictors for a right-censored time-to-event outcome under a high-dimensional accelerated failure time (AFT) model. Establishing a rigorous screening test in this setting is challenging, not only because of the right censoring, but also due to the post-selection inference. The oracle property in such situations fails to ensure adequate control of the family-wise error rate, and this raises questions about the applicability of standard inferential methods. McKeague and Qian (2015) constructed an adaptive resampling test to circumvent this problem under ordinary linear regression. To accommodate right censoring, we develop a test statistic based on a maximally selected Koul--Susarla--Van Ryzin estimator from a marginal AFT model. A regularized bootstrap method is used to calibrate the test. Our test is more powerful and less conservative than the Bonferroni correction and other competing methods. This proposed method is evaluated in simulation studies and applied to two real data sets.Biometry, Statistics, Survival analysis (Biometry)--Data processing, Failure time data analysisth2455BiostatisticsThesesEmpirical Bayes, Bayes factors and deoxyribonucleic acid fingerprinting
https://academiccommons.columbia.edu/catalog/ac:2jm63xsj4b
Basu, Ruma10.7916/D8J67VGBWed, 04 Oct 2017 22:15:57 +0000The central theme in this thesis is Empirical Bayes. It starts off with application of Bayes and Empirical Bayes methods to deoxyribonucleic acid fingerprinting. Different Bayes factors are obtained and an alternative Bayes factor using the method of Savage is studied both for normal and non- normal priors. It then moves on to deeper methodological aspects of Empirical Bayes theory. A 1983 conjecture by Carl Morris on the parametric empirical Bayes prediction intervals for the normal regression model is studied and an improvement suggested. Carlin and Louis’ (1996) parametric empirical Bayes prediction interval for the same model is also dealt with analytically while their approach had been primarily numerical. It is seen that both of these intervals have the same coverage probability up to a certain order of approximation and they have the same expected length up to the same order of approximation. Both the intervals are equal tailed up to the same order of approximation. Then the corrected proof of an important published result by Datta, Ghosh and Mukerjee (2000) is provided using first principles of probability matching. This result is relevant to our work on parametric empirical Bayes prediction intervals.Statistics, DNA fingerprinting, BioinformaticsStatisticsThesesDeveloping Statistical Methods for Incorporating Complexity in Association Studies
https://academiccommons.columbia.edu/catalog/ac:2bvq83bk43
Palmer, Cameron Douglas10.7916/D8SQ9BX2Wed, 04 Oct 2017 22:15:44 +0000Genome-wide association studies (GWAS) have identified thousands of genetic variants associated with hundreds of human traits. Yet the common variant model tested by traditional GWAS only provides an incomplete explanation for the known genetic heritability of many traits. Many divergent methods have been proposed to address the shortcomings of GWAS, including most notably the extension of association methods into rarer variants through whole exome and whole genome sequencing. GWAS methods feature numerous simplifications designed for feasibility and ease of use, as opposed to statistical rigor. Furthermore, no systematic quantification of the performance of GWAS across all traits exists. Beyond improving the utility of data that already exist, a more thorough understanding of the performance of GWAS on common variants may elucidate flaws not in the method but rather in its implementation, which may pose a continued or growing threat to the utility of rare variant association studies now underway.
This thesis focuses on systematic evaluation and incremental improvement of GWAS modeling. We collect a rich dataset containing standardized association results from all GWAS conducted on quantitative human traits, finding that while the majority of published significant results in the field do not disclose sufficient information to determine whether the results are actually valid, those that do replicate precisely in concordance with their statistical power when conducted in samples of similar ancestry and reporting accurate per-locus sample sizes. We then look to the inability of effectively all existing association methods to handle missingness in genetic data, and show that adapting missingness theory from statistics can both increase power and provide a flexible framework for extending most existing tools with minimal effort. We finally undertake novel variant association in a schizophrenia cohort from a bottleneck population. We find that the study itself is confounded by nonrandom population sampling and identity-by-descent, manifesting as batch effects correlated with outcome that remain in novel variants after all sample-wide quality control. On the whole, these results emphasize both the past and present utility and reliability of the GWAS model, as well as the extent to which lessons from the GWAS era must inform genetic studies moving forward.Bioinformatics, Human genome--Research, Statistics, Genomics--Statistical methods, Research--Data processingcdp2130Cellular, Molecular and Biomedical StudiesThesesStatistical correction of the Winner’s Curse explains replication variability in quantitative trait genome-wide association studies
https://academiccommons.columbia.edu/catalog/ac:5mkkwh70sx
Palmer, Cameron Douglas; Pe'er, Itshack G.10.7916/D8XD1D6KSat, 30 Sep 2017 17:42:59 +0000Genome-wide association studies (GWAS) have identified hundreds of SNPs responsible for variation in human quantitative traits. However, genome-wide-significant associations often fail to replicate across independent cohorts, in apparent inconsistency with their apparent strong effects in discovery cohorts. This limited success of replication raises pervasive questions about the utility of the GWAS field. We identify all 332 studies of quantitative traits from the NHGRI-EBI GWAS Database with attempted replication. We find that the majority of studies provide insufficient data to evaluate replication rates. The remaining papers replicate significantly worse than expected (p < 10−14), even when adjusting for regression-to-the-mean of effect size between discovery- and replication-cohorts termed the Winner’s Curse (p < 10−16). We show this is due in part to misreporting replication cohort-size as a maximum number, rather than per-locus one. In 39 studies accurately reporting per-locus cohort-size for attempted replication of 707 loci in samples with similar ancestry, replication rate matched expectation (predicted 458, observed 457, p = 0.94). In contrast, ancestry differences between replication and discovery (13 studies, 385 loci) cause the most highly-powered decile of loci to replicate worse than expected, due to difference in linkage disequilibrium.DNA replication, Genomes--Data processing, Human genetics--Variation, Genetics, Statisticscdp2130, ip2169Computer Science, Biological SciencesArticlesProperty Testing and Probability Distributions: New Techniques, New Models, and New Goals
https://academiccommons.columbia.edu/catalog/ac:vmcvdncjwb
Canonne, Clement Louis10.7916/D8NK3SK2Fri, 29 Sep 2017 22:19:58 +0000In order to study the real world, scientists (and computer scientists) develop simplified models that attempt to capture the essential features of the observed system. Understanding the power and limitations of these models, when they apply or fail to fully capture the situation at hand, is therefore of uttermost importance.
In this thesis, we investigate the role of some of these models in property testing of probability distributions (distribution testing), as well as in related areas. We introduce natural extensions of the standard model (which only allows access to independent draws from the underlying distribution), in order to circumvent some of its limitations or draw new insights about the problems they aim at capturing. Our results are organized in three main directions:
(i) We provide systematic approaches to tackle distribution testing questions. Specifically, we provide two general algorithmic frameworks that apply to a wide range of properties, and yield efficient and near-optimal results for many of them. We complement these by introducing two methodologies to prove information-theoretic lower bounds in distribution testing, which enable us to derive hardness results in a clean and unified way.
(ii) We introduce and investigate two new models of access to the unknown distributions, which both generalize the standard sampling model in different ways and allow testing algorithms to achieve significantly better efficiency. Our study of the power and limitations of algorithms in these models shows how these could lead to faster algorithms in practical situations, and yields a better understanding of the underlying bottlenecks in the standard sampling setting.
(iii) We then leave the field of distribution testing to explore areas adjacent to property testing. We define a new algorithmic primitive of sampling correction, which in some sense lies in between distribution learning and testing and aims to capture settings where data originates from imperfect or noisy sources. Our work sets out to model these situations in a rigorous and abstracted way, in order to enable the development of systematic methods to address these issues.Computer science, Statistics, Distribution (Probability theory), Algorithmsclc2200Computer ScienceThesesDistributionally Robust Performance Analysis with Applications to Mine Valuation and Risk
https://academiccommons.columbia.edu/catalog/ac:g4f4qrfj84
Dolan, Christopher James10.7916/D8QJ7VSCFri, 29 Sep 2017 22:18:31 +0000We consider several problems motivated by issues faced in the mining industry. In recent years, it has become clear that mines have substantial tail risk in the form of environmental disasters, and this tail risk is not incorporated into common pricing and risk models. However, data sets of the extremal climate behavior that drive this risk are very small, and generally inadequate for properly estimating the tail behavior. We propose a data-driven methodology that comes up with reasonable worst-case scenarios, given the data size constraints, and we incorporate this into a real options based model for the valuation of mines. We propose several different iterations of the model, to allow the end-user to choose the degree to which they wish to specify the financial consequences of the disaster scenario. Next, in order to perform a risk analysis on a portfolio of mines, we propose a method of estimating the correlation structure of high-dimensional max-stable processes. Using the techniques of (Liu Et al, 2017) to map the relationship between normal correlations and max-stable correlations, we can then use techniques inspired by (Bickel et al, 2008, Liu et al, 2014, Rothman et al, 2009) to estimate the underlying correlation matrix, while preserving a sparse, positive-definite structure. The correlation matrices are then used in the calculation of model-robust risk metrics (VaR, CVAR) using the the Sample-Out-of-Sample methodology (Blanchet and Kang, 2017). We conclude with several new techniques that were developed in the field of robust performance analysis, that while not directly applied to mining, were motivated by our studies into distributionally robust optimization in order to address these problems.Statistics, Mine valuation--Statistical methods, Robust statisticscjd2119StatisticsThesesDistributionally Robust Optimization and its Applications in Machine Learning
https://academiccommons.columbia.edu/catalog/ac:9cnp5hqc0q
Kang, Yang10.7916/D8WD4C1RFri, 25 Aug 2017 22:13:06 +0000The goal of Distributionally Robust Optimization (DRO) is to minimize the cost of running a stochastic system, under the assumption that an adversary can replace the underlying baseline stochastic model by another model within a family known as the distributional uncertainty region. This dissertation focuses on a class of DRO problems which are data-driven, which generally speaking means that the baseline stochastic model corresponds to the empirical distribution of a given sample.
One of the main contributions of this dissertation is to show that the class of data-driven DRO problems that we study unify many successful machine learning algorithms, including square root Lasso, support vector machines, and generalized logistic regression, among others. A key distinctive feature of the class of DRO problems that we consider here is that our distributional uncertainty region is based on optimal transport costs. In contrast, most of the DRO formulations that exist to date take advantage of a likelihood based formulation (such as Kullback-Leibler divergence, among others). Optimal transport costs include as a special case the so-called Wasserstein distance, which is popular in various statistical applications.
The use of optimal transport costs is advantageous relative to the use of divergence-based formulations because the region of distributional uncertainty contains distributions which explore samples outside of the support of the empirical measure, therefore explaining why many machine learning algorithms have the ability to improve generalization. Moreover, the DRO representations that we use to unify the previously mentioned machine learning algorithms, provide a clear interpretation of the so-called regularization parameter, which is known to play a crucial role in controlling generalization error. As we establish, the regularization parameter corresponds exactly to the size of the distributional uncertainty region.
Another contribution of this dissertation is the development of statistical methodology to study data-driven DRO formulations based on optimal transport costs. Using this theory, for example, we provide a sharp characterization of the optimal selection of regularization parameters in machine learning settings such as square-root Lasso and regularized logistic regression.
Our statistical methodology relies on the construction of a key object which we call the robust Wasserstein profile function (RWP function). The RWP function similar in spirit to the empirical likelihood profile function in the context of empirical likelihood (EL). But the asymptotic analysis of the RWP function is different because of a certain lack of smoothness which arises in a suitable Lagrangian formulation.
Optimal transport costs have many advantages in terms of statistical modeling. For example, we show how to define a class of novel semi-supervised learning estimators which are natural companions of the standard supervised counterparts (such as square root Lasso, support vector machines, and logistic regression). We also show how to define the distributional uncertainty region in a purely data-driven way. Precisely, the optimal transport formulation allows us to inform the shape of the distributional uncertainty, not only its center (which given by the empirical distribution). This shape is informed by establishing connections to the metric learning literature. We develop a class of metric learning algorithms which are based on robust optimization. We use the robust-optimization-based metric learning algorithms to inform the distributional uncertainty region in our data-driven DRO problem. This means that we endow the adversary with additional which force him to spend effort on regions of importance to further improve generalization properties of machine learning algorithms.
In summary, we explain how the use of optimal transport costs allow constructing what we call double-robust statistical procedures. We test all of the procedures proposed in this paper in various data sets, showing significant improvement in generalization ability over a wide range of state-of-the-art procedures.
Finally, we also discuss a class of stochastic optimization algorithms of independent interest which are particularly useful to solve DRO problems, especially those which arise when the distributional uncertainty region is based on optimal transport costs.Statistics, Robust optimization, Machine learning, Mathematical optimizationyk2606StatisticsThesesAccurate and Sensitive Quantification of Protein-DNA Binding Affinity
https://academiccommons.columbia.edu/catalog/ac:9p8cz8w9hj
Rastogi, Chaitanya10.7916/D86T104BTue, 22 Aug 2017 22:27:02 +0000Transcription factors control gene expression by binding to genomic DNA in a sequence-specific manner. Mutations in transcription factor binding sites are increasingly found to be associated with human disease, yet we currently lack robust methods to predict these sites. Here we developed a versatile maximum likelihood framework, named No Read Left Behind (NRLB), that fits a biophysical model of protein-DNA recognition to all in vitro selected DNA binding sites across the full affinity range. NRLB predicts human Max homodimer binding in near-perfect agreement with existing low-throughput measurements. The model captures the specificity of p53 tetrameric binding sites and discovers multiple binding modes in a single sample. Additionally, we confirm that newly-identified low-affinity enhancer binding sites are functional in vivo, and that their contribution to gene expression matches their predicted affinity. Our results establish a powerful paradigm for identifying protein binding sites and interpreting gene regulatory sequences in eukaryotic genomes.Genetics, Biophysics, Statistics, DNA-protein interactions, Transcription factorscr2166Applied Physics and Applied MathematicsThesesA unified view of high-dimensional bridge regression
https://academiccommons.columbia.edu/catalog/ac:3xsj3tx96j
Weng, Haolei10.7916/D82V2THPTue, 15 Aug 2017 22:36:12 +0000In many application areas ranging from bioinformatics to imaging, we are interested in recovering a sparse coefficient in the high-dimensional linear model, when the sample size n is comparable to or less than the dimension p. One of the most popular classes of estimators is the Lq-regularized least squares (LQLS), a.k.a. bridge regression. There have been extensive studies towards understanding the performance of the best subset selection (q=0), LASSO (q=1) and ridge (q=2), three widely known estimators from the LQLS family. This thesis aims at giving a unified view of LQLS for all the non-negative values of q. In contrast to most existing works which obtain order-wise error bounds with loose constants, we derive asymptotically exact error formulas characterized through a series of fixed point equations. A delicate analysis of the fixed point equations enables us to gain fruitful insights into the statistical properties of LQLS across the entire spectrum of Lq-regularization. Our work not only validates the scope of folklore understanding of Lq-minimization, but also provides new insights into high-dimensional statistics as a whole. We will elaborate on our theoretical findings mainly from parameter estimation point of view. At the end of the thesis, we briefly mention bridge regression for variable selection and prediction.
We start by considering the parameter estimation problem and evaluate the performance of LQLS by characterizing the asymptotic mean square error (AMSE). The expression we derive for AMSE does not have explicit forms and hence is not useful in comparing LQLS for different values of q, or providing information in evaluating the effect of relative sample size n/p or the sparsity level of the coefficient. To simplify the expression, we first perform the phase transition (PT) analysis, a widely accepted analysis diagram, of LQLS. Our results reveal some of the limitations and misleading features of the PT framework. To overcome these limitations, we propose the small-error analysis of LQLS. Our new analysis framework not only sheds light on the results of the phase transition analysis, but also describes when phase transition analysis is reliable, and presents a more accurate comparison among different Lq-regularizations.
We then extend our low noise sensitivity analysis to linear models without sparsity structure. Our analysis, as a generalization of phase transition analysis, reveals a clear picture of bridge regression for estimating generic coefficients. Moreover, by a simple transformation we connect our low-noise sensitivity framework to the classical asymptotic regime in which n/p goes to infinity, and give some insightful implications beyond what classical asymptotic analysis of bridge regression can offer.
Furthermore, following the same idea of the new analysis framework, we are able to obtain an explicit characterization of AMSE in the form of second-order expansions under the large noise regime. The expansions provide us some intriguing messages. For example, ridge will outperform LASSO in terms of estimating sparse coefficients when the measurement noise is large.
Finally, we present a short analysis of LQLS, for the purpose of variable selection and prediction. We propose a two-stage variable selection technique based on the LQLS estimators, and describe its superiority and close connection to parameter estimation. For prediction, we illustrate the intricate relation between the tuning parameter selection for optimal in-sample prediction and optimal parameter estimation.Statistics, Regression analysis, Mathematicshw2375StatisticsThesesContributions to Semiparametric Inference to Biased-Sampled and Financial Data
https://academiccommons.columbia.edu/catalog/ac:177018
Sit, Tony10.7916/D81R72W2Wed, 09 Aug 2017 15:54:08 +0000This thesis develops statistical models and methods for the analysis of life-time and financial data under the umbrella of semiparametric framework. The first part studies the use of empirical likelihood on Levy processes that are used to model the dynamics exhibited in the financial data. The second part is a study of inferential procedure for survival data collected under various biased sampling schemes in transformation and the accelerated failure time models. During the last decade Levy processes with jumps have received increasing popularity for modelling market behaviour for both derivative pricing and risk management purposes. Chan et al. (2009) introduced the use of empirical likelihood methods to estimate the parameters of various diffusion processes via their characteristic functions which are readily available in most cases. Return series from the market are used for estimation. In addition to the return series, there are many derivatives actively traded in the market whose prices also contain information about parameters of the underlying process. This observation motivates us to combine the return series and the associated derivative prices observed at the market so as to provide a more reflective estimation with respect to the market movement and achieve a gain in efficiency. The usual asymptotic properties, including consistency and asymptotic normality, are established under suitable regularity conditions. We performed simulation and case studies to demonstrate the feasibility and effectiveness of the proposed method. The second part of this thesis investigates a unified estimation method for semiparametric linear transformation models and accelerated failure time model under general biased sampling schemes. The methodology proposed is first investigated in Paik (2009) in which the length-biased case is considered for transformation models. The new estimator is obtained from a set of counting process-based unbiased estimating equations, developed through introducing a general weighting scheme that offsets the sampling bias. The usual asymptotic properties, including consistency and asymptotic normality, are established under suitable regularity conditions. A closed-form formula is derived for the limiting variance and the plug-in estimator is shown to be consistent. We demonstrate the unified approach through the special cases of left truncation, length-bias, the case-cohort design and variants thereof. Simulation studies and applications to real data sets are also presented.Statisticsts2500StatisticsThesesDetecting Dependence Change Points in Multivariate Time Series with Applications in Neuroscience and Finance
https://academiccommons.columbia.edu/catalog/ac:177012
Cribben, Ivor John10.7916/D8JQ1CF0Wed, 09 Aug 2017 15:53:53 +0000In many applications there are dynamic changes in the dependency structure between multivariate time series. Two examples include neuroscience and finance. The second and third chapters focus on neuroscience and introduce a data-driven technique for partitioning a time course into distinct temporal intervals with different multivariate functional connectivity patterns between a set of brain regions of interest (ROIs). The technique, called Dynamic Connectivity Regression (DCR), detects temporal change points in functional connectivity and estimates a graph, or set of relationships between ROIs, for data in the temporal partition that falls between pairs of change points. Hence, DCR allows for estimation of both the time of change in connectivity and the connectivity graph for each partition, without requiring prior knowledge of the nature of the experimental design. Permutation and bootstrapping methods are used to perform inference on the change points. In the second chapter of this work, we focus on multi-subject data while in the third chapter, we concentrate on single-subject data and extend the DCR methodology in two ways: (i) we alter the algorithm to make it more accurate for individual subject data with a small number of observations and (ii) we perform inference on the edges or connections between brain regions in order to reduce the number of false positives in the graphs. We also discuss a Likelihood Ratio test to compare precision matrices (inverse covariance matrices) across subjects as well as a test across subjects on the single edges or partial correlations in the graph. In the final chapter of this work, we turn to a finance setting. We use the same DCR technique to detect changes in dependency structure in multivariate financial time series for situations where both the placement and number of change points is unknown. In this setting, DCR finds the dependence change points and estimates an undirected graph representing the relationship between time series within each interval created by pairs of adjacent change points. A shortcoming of the proposed DCR methodology is the presence of an excessive number of false positive edges in the undirected graphs, especially when the data deviates from normality. Here we address this shortcoming by proposing a procedure for performing inference on the edges, or partial dependencies between time series, that effectively removes false positive edges. We also discuss two robust estimation procedures based on ranks and the tlasso (Finegold and Drton, 2011) technique, which we contrast with the glasso technique used by DCR.Statisticsijc2104StatisticsThesesStructured Tensor Recovery and Decomposition
https://academiccommons.columbia.edu/catalog/ac:x3ffbg79fs
Mu, Cun10.7916/D8DV1X6MMon, 17 Jul 2017 16:14:19 +0000Tensors, a.k.a. multi-dimensional arrays, arise naturally when modeling higher-order objects and relations. Among ubiquitous applications including image processing, collaborative filtering, demand forecasting and higher-order statistics, there are two recurring themes in general: tensor recovery and tensor decomposition. The first one aims to recover the underlying tensor from incomplete information; the second one is to study a variety of tensor decompositions to represent the array more concisely and moreover to capture the salient characteristics of the underlying data. Both topics are respectively addressed in this thesis.
Chapter 2 and Chapter 3 focus on low-rank tensor recovery (LRTR) from both theoretical and algorithmic perspectives. In Chapter 2, we first provide a negative result to the sum of nuclear norms (SNN) model---an existing convex model widely used for LRTR; then we propose a novel convex model and prove this new model is better than the SNN model in terms of the number of measurements required to recover the underlying low-rank tensor. In Chapter 3, we first build up the connection between robust low-rank tensor recovery and the compressive principle component pursuit (CPCP), a convex model for robust low-rank matrix recovery. Then we focus on developing convergent and scalable optimization methods to solve the CPCP problem. In specific, our convergent method, proposed by combining classical ideas from Frank-Wolfe and proximal methods, achieves scalability with linear per-iteration cost.
Chapter 4 generalizes the successive rank-one approximation (SROA) scheme for matrix eigen-decomposition to a special class of tensors called symmetric and orthogonally decomposable (SOD) tensor. We prove that the SROA scheme can robustly recover the symmetric canonical decomposition of the underlying SOD tensor even in the presence of noise. Perturbation bounds, which can be regarded as a higher-order generalization of the Davis-Kahan theorem, are provided in terms of the noise magnitude.Operations research, Computer science, Statistics, Calculus of tensorscm3052Industrial Engineering and Operations ResearchThesesAdvantages of Synthetic Noise and Machine Learning for Analyzing Radioecological Data Sets
https://academiccommons.columbia.edu/catalog/ac:206927
Shuryak, Igor10.7916/D80V8JF4Wed, 05 Jul 2017 13:44:23 +0000The ecological effects of accidental or malicious radioactive contamination are insufficiently understood because of the hazards and difficulties associated with conducting studies in radioactively-polluted areas. Data sets from severely contaminated locations can therefore be small. Moreover, many potentially important factors, such as soil concentrations of toxic chemicals, pH, and temperature, can be correlated with radiation levels and with each other. In such situations, commonly-used statistical techniques like generalized linear models (GLMs) may not be able to provide useful information about how radiation and/or these other variables affect the outcome (e.g. abundance of the studied organisms). Ensemble machine learning methods such as random forests offer powerful alternatives. We propose that analysis of small radioecological data sets by GLMs and/or machine learning can be made more informative by using the following techniques: (1) adding synthetic noise variables to provide benchmarks for distinguishing the performances of valuable predictors from irrelevant ones; (2) adding noise directly to the predictors and/or to the outcome to test the robustness of analysis results against random data fluctuations; (3) adding artificial effects to selected predictors to test the sensitivity of the analysis methods in detecting predictor effects; (4) running a selected machine learning method multiple times (with different random-number seeds) to test the robustness of the detected “signal”; (5) using several machine learning methods to test the “signal’s” sensitivity to differences in analysis techniques. Here, we applied these approaches to simulated data, and to two published examples of small radioecological data sets: (I) counts of fungal taxa in samples of soil contaminated by the Chernobyl nuclear power plan accident (Ukraine), and (II) bacterial abundance in soil samples under a ruptured nuclear waste storage tank (USA). We show that the proposed techniques were advantageous compared with the methodology used in the original publications where the data sets were presented. Specifically, our approach identified a negative effect of radioactive contamination in data set I, and suggested that in data set II stable chromium could have been a stronger limiting factor for bacterial abundance than the radionuclides 137Cs and 99Tc. This new information, which was extracted from these data sets using the proposed techniques, can potentially enhance the design of radioactive waste bioremediation.Machine learning, Statistics, Radioactive pollution, Medical radiologyis144RadiologyArticlesTime Series Modeling with Shape Constraints
https://academiccommons.columbia.edu/catalog/ac:qz612jm65v
Zhang, Jing10.7916/D84X5M55Fri, 30 Jun 2017 22:15:28 +0000This thesis focuses on the development of semiparametric estimation methods for a class of time series models using shape constraints. Many of the existing time series models assume the noise follows some known parametric distributions. Typical examples are the Gaussian and t distributions. Then the model parameters are estimated by maximizing the resultant likelihood function.
As an example, the autoregressive moving average (ARMA) models (Brockwell and Davis, 2009) assume Gaussian noise sequence and are estimated under the causal-invertible constraint by maximizing the Gaussian likelihood. Although the same estimates can also be used in the causal-invertible non-Gaussian case, they are not asymptotically optimal (Rosenblatt, 2012). Moreover, for the noncausal/noninvertible cases, the Gaussian likelihood estimation procedure is not applicable, since any second-order based methods cannot distinguish between causal-invertible and noncausal/noninvertible models (Brockwell and Davis,2009). As a result, many estimation methods for noncausal/noninvertible ARMA models assume the noise follows a known non-Gaussian distribution, like a Laplace distribution or a t distribution. To relax this distributional assumption and allow noncausal/noninvertible models, we borrow ideas from nonparametric shape-constraint density estimation and propose a semiparametric estimation procedure for general ARMA models by projecting the underlying noise distribution onto the space of log-concave measures (Cule and Samworth, 2010; Dümbgen et al., 2011). We show the maximum likelihood estimators in this semiparametric setting are consistent. In fact, the MLE is robust to the misspecification of log-concavity in cases where the true distribution of the noise is close to its log-concave projection. We derive a lower bound for the best asymptotic variance of regular estimators at rate sqrt(n) for AR models and construct a semiparametric efficient estimator.
We also consider modeling time series of counts with shape constraints. Many of the formulated models for count time series are expressed via a pair of generalized state-space equations. In this set-up, the observation equation specifies the conditional distribution of the observation Yt at time t given a state-variable Xt. For count time series, this conditional distribution is usually specified as coming from a known parametric family such as the Poisson or the Negative Binomial distribution. To relax this formal parametric framework, we introduce a concave shape constraint into the one-parameter exponential family. This essentially amounts to assuming that the reference measure is log-concave. In this fashion, we are able to extend the class of observation-driven models studied in Davis and Liu (2016). Under this formulation, there exists a stationary and ergodic solution to the state-space model. In this new modeling framework, we consider the inference problem of estimating both the parameters of the mean model and the log-concave function, corresponding to the reference measure. We then compute and maximize the likelihood function over both the parameters associated with the mean function and the reference measure subject to a concavity constraint. The estimator of the mean function and the conditional distribution are shown to be consistent and perform well compared to a full parametric model specification. The finite sample behavior of the estimators are studied via simulation and two empirical examples are provided to illustrate the methodology.Statistics, Time-series analysis--Mathematical modelsjz2300StatisticsThesesBias Characterization in Probabilistic Genotype Data and Improved Signal Detection with Multiple Imputation
https://academiccommons.columbia.edu/catalog/ac:201409
Palmer, Cameron Douglas; Pe’er, Itsik10.7916/D8JS9QKNFri, 30 Jun 2017 18:33:07 +0000Missing data are an unavoidable component of modern statistical genetics. Different array or sequencing technologies cover different single nucleotide polymorphisms (SNPs), leading to a complicated mosaic pattern of missingness where both individual genotypes and entire SNPs are sporadically absent. Such missing data patterns cannot be ignored without introducing bias, yet cannot be inferred exclusively from nonmissing data. In genome-wide association studies, the accepted solution to missingness is to impute missing data using external reference haplotypes. The resulting probabilistic genotypes may be analyzed in the place of genotype calls. A general-purpose paradigm, called Multiple Imputation (MI), is known to model uncertainty in many contexts, yet it is not widely used in association studies. Here, we undertake a systematic evaluation of existing imputed data analysis methods and MI. We characterize biases related to uncertainty in association studies, and find that bias is introduced both at the imputation level, when imputation algorithms generate inconsistent genotype probabilities, and at the association level, when analysis methods inadequately model genotype uncertainty. We find that MI performs at least as well as existing methods or in some cases much better, and provides a straightforward paradigm for adapting existing genotype association methods to uncertain data.Multiple imputation (Statistics), Missing observations (Statistics), Genetics--Statistical methods, Genetics, Statisticscdp2130Computer ScienceArticlesFinding Alternatives to the Dogma of Power Based Sample Size Calculation: Is a Fixed Sample Size Prospective Meta-Experiment a Potential Alternative?
https://academiccommons.columbia.edu/catalog/ac:201635
Tavernier, Elsa; Trinquart, Ludovic; Giraudeau, Bruno10.7916/D89P31T1Fri, 30 Jun 2017 18:31:22 +0000Sample sizes for randomized controlled trials are typically based on power calculations. They require us to specify values for parameters such as the treatment effect, which is often difficult because we lack sufficient prior information. The objective of this paper is to provide an alternative design which circumvents the need for sample size calculation. In a simulation study, we compared a meta-experiment approach to the classical approach to assess treatment efficacy. The meta-experiment approach involves use of meta-analyzed results from 3 randomized trials of fixed sample size, 100 subjects. The classical approach involves a single randomized trial with the sample size calculated on the basis of an a priori-formulated hypothesis. For the sample size calculation in the classical approach, we used observed articles to characterize errors made on the formulated hypothesis. A prospective meta-analysis of data from trials of fixed sample size provided the same precision, power and type I error rate, on average, as the classical approach. The meta-experiment approach may provide an alternative design which does not require a sample size calculation and addresses the essential need for study replication; results may have greater external validity.Clinical trials, Meta-analysis, Epidemiology--Methodology, Public health, Epidemiology, StatisticsEpidemiologyArticlesReply
https://academiccommons.columbia.edu/catalog/ac:196838
Mason, Simon J.; Tippett, Michael K.; Weigel, Andreas P.; Goddard, Lisa M.; Rajaratnam, Balakanapathy10.7916/D8Z31ZKBFri, 30 Jun 2017 16:52:38 +0000Reply to a comment on the article: Conditional Exceedance Probabilities. Monthly Weather Review 135 (2010), 363–372 (available in Academic Commons at http://dx.doi.org/10.7916/D8PK0G2S).Climatic changes--Mathematical models, Statistics, Climatic changes--Forecasting, Atmospheresjm2103, mkt14, lmg107International Research Institute for Climate and Society, Applied Physics and Applied MathematicsArticlesComparative Validity of 3 Diabetes Mellitus Risk Prediction Scoring Models in a Multiethnic US Cohort: The Multi-Ethnic Study of Atherosclerosis
https://academiccommons.columbia.edu/catalog/ac:200753
Mann, Devin M.; Bertoni, Alain G.; Shimbo, Daichi; Carnethon, Mercedes R.; Chen, Haiying; Jenny, Nancy Swords; Muntner, Paul10.7916/D8SN093SFri, 30 Jun 2017 16:52:05 +0000Several models for estimating risk of incident diabetes in US adults are available. The authors aimed to determine the discriminative ability and calibration of published diabetes risk prediction models in a contemporary multiethnic cohort. Participants in the Multi-Ethnic Study of Atherosclerosis without diabetes at baseline (2000–2002; n = 5,329) were followed for a median of 4.75 years. The predicted risk of diabetes was calculated using published models from the Framingham Offspring Study, the Atherosclerosis Risk in Communities (ARIC) Study, and the San Antonio Heart Study. The mean age of participants was 61.6 years (standard deviation, 10.2); 29.3% were obese, 53.1% had hypertension, 34.9% had a family history of diabetes, 27.5% had high triglyceride levels, 33.8% had low high density lipoprotein cholesterol levels, and 15.3% had impaired fasting glucose. There were 446 incident cases of diabetes (fasting glucose level ≥126 mg/dL or initiation of antidiabetes medication use) diagnosed during follow-up. C statistics were 0.78, 0.84, and 0.83 for the Framingham, ARIC, and San Antonio risk prediction models, respectively. There were significant differences between observed and predicted diabetes risks (Hosmer-Lemeshow goodness-of-fit chi-squared test for each model: P < 0.001). The recalibrated and best-fit models achieved sufficient goodness of fit (each P > 0.10). The Framingham, ARIC, and San Antonio models maintained high discriminative ability but required recalibration in a modern, multiethnic US cohort.Diabetes--Epidemiology, Cohort analysis, Diabetes--Risk factors, Epidemiology, Statisticsds2231Center for Behavioral Cardiovascular HealthArticlesConditional Exceedance Probabilities
https://academiccommons.columbia.edu/catalog/ac:196847
Mason, Simon J.; Galpin, Jacqueline S.; Goddard, Lisa M.; Graham, Nicholas E.; Rajaratnam, Balakanapathy10.7916/D8PK0G2SFri, 30 Jun 2017 16:50:29 +0000Probabilistic forecasts of variables measured on a categorical or ordinal scale, such as precipitation occurrence or temperatures exceeding a threshold, are typically verified by comparing the relative frequency with which the target event occurs given different levels of forecast confidence. The degree to which this conditional (on the forecast probability) relative frequency of an event corresponds with the actual forecast probabilities is known as reliability, or calibration. Forecast reliability for binary variables can be measured using the Murphy decomposition of the (half) Brier score, and can be presented graphically using reliability and attributes diagrams. For forecasts of variables on continuous scales, however, an alternative measure of reliability is required. The binned probability histogram and the reliability component of the continuous ranked probability score have been proposed as appropriate verification procedures in this context, but are subject to some limitations. A procedure is proposed that is applicable in the context of forecast ensembles and is an extension of the binned probability histogram. Individual ensemble members are treated as estimates of quantiles of the forecast distribution, and the conditional probability that the observed precipitation, for example, exceeds the amount forecast [the conditional exceedance probability (CEP)] is calculated. Generalized linear regression is used to estimate these conditional probabilities. A diagram showing the CEPs for ranked ensemble members is suggested as a useful method for indicating reliability when forecasts are on a continuous scale, and various statistical tests are suggested for quantifying the reliability.Climatic changes--Mathematical models, Statistics, Climatic changes--Forecasting, Atmospheresjm2103, lmg107International Research Institute for Climate and SocietyArticlesAssessing the predictability of extreme rainfall seasons over southern Africa
https://academiccommons.columbia.edu/catalog/ac:196917
Landman, Willem A.; Botes, Stephanie; Goddard, Lisa M.; Shongwe, Mxolisi10.7916/D8B56JPQFri, 30 Jun 2017 16:49:40 +0000A model output statistics (MOS) technique is developed to investigate the potential rainfall forecast skill for extreme seasons over southern Africa. Rainfall patterns produced by the ECHAM4.5 atmospheric GCM are statistically recalibrated to regional rainfall for the seasons of September–November, December–February, March–May and June–August. Archived records of the GCM simulated fields are related to observed rainfall through a set of canonical correlation analysis (CCA) equations. Probabilistic forecast skill (RPSS and ROC) of MOS-recalibrated simulations for 5 equi-probable categories is assessed using a 3-year-out cross-validation approach. High skill RPSS values are found for the DJF and MAM seasons. Although ROC scores for DJF and MAM are larger than 0.5 for all categories (scores less than 0.5 suggest negative skill), scores for DJF show that the extreme categories are more predictable than the inner categories and scores for MAM show that skill is mostly associated with the extremely wet category. The GCM's ability to reproduce tropical-temperate trough variability constitutes the main source of predictability for DJF and MAM.Atmospheric circulation, Statistics, Precipitation forecasting, Atmospherewal2113, lmg107International Research Institute for Climate and SocietyArticlesNine Justices Ten Years Statistical Retrospective
https://academiccommons.columbia.edu/catalog/ac:199318
Jackson, Robert J.; Vjgnarajah, Thiruvendran10.7916/D800024NFri, 30 Jun 2017 16:49:40 +0000The 2003 Term marked an unprecedented milestone for the Supreme Court for the first time in history nine Justices celebrated full decade presiding together over the nations highest court The continuity of the current Court is especially striking given that on average one new Justice has been appointed approximately every two years since the Courts expansion to nine members in 1837.2 Although the Harvard Law Review has prepared statistical retrospectives in the past3 the last decade presents rare opportunity to study the Court free from the disruptions of intervening appointments Presented here is review of the 823 cases decided by the Court over the past decade Of course bare statistics cannot capture the nuanced interactions among the Justices nor substantiate any particular theory about the complex dynamics of the Court Rather this statistical compilation and the preliminary observations articulated here are intended only as starting point modest effort to showcase trends that deserve closer attention and to jumpstart more robust analyses of how the Court despite its apparent stability has evolved over the past decade.Law reports, digests, etc., Law, Statistics, United States. Supreme Courtrj2317LawArticlesPredicting southern African summer rainfall using a combination of MOS and perfect prognosis
https://academiccommons.columbia.edu/catalog/ac:196920
Landman, Willem A.; Goddard, Lisa M.10.7916/D8959HHWFri, 30 Jun 2017 16:49:40 +0000A statistical-dynamical approach to probabilistic precipitation forecasts of southern African summer rainfall is described and validated. An ensemble of seasonal precipitation and circulation fields is obtained from the ECHAM4.5 atmospheric general circulation model (AGCM). Model output statistics (MOS) then spatially recalibrate the AGCM fields relative to observations. Although the MOS equations are built using the simulation data, in which observed SSTs force the AGCM, the same set of equations can be applied to the predicted data, in which predicted SSTs force the AGCM. The use of prediction data in a set of equations developed for simulations, assumes that the AGCM forecast skill approximates its simulation skill and that the systematic biases of the AGCM do not change in a prediction setting; this assumption is analogous to a perfect prognosis (PP) approach. Probabilistic forecast skill is assessed using this MOS-PP-recalibration scheme for 3 equi-probable categories using a 3-year-out cross-validation approach. High skill scores are found over the north-eastern interior of the region, with marginal skill over the remainder of the austral summer rainfall regions. When skill is assessed for only the wettest and driest of the years, high skill appears over most of the region.Atmospheric circulation, Statistics, Precipitation forecasting, Atmospherewal2113, lmg107International Research Institute for Climate and SocietyArticlesStatistical–Dynamical Seasonal Forecasts of Central-Southwest Asian Winter Precipitation
https://academiccommons.columbia.edu/catalog/ac:196890
Tippett, Michael K.; Goddard, Lisa M.; Barnston, Anthony G.10.7916/D8NK3DZFFri, 30 Jun 2017 16:49:40 +0000Interannual precipitation variability in central-southwest (CSW) Asia has been associated with East Asian jet stream variability and western Pacific tropical convection. However, atmospheric general circulation models (AGCMs) forced by observed sea surface temperature (SST) poorly simulate the region’s interannual precipitation variability. The statistical–dynamical approach uses statistical methods to correct systematic deficiencies in the response of AGCMs to SST forcing. Statistical correction methods linking model-simulated Indo–west Pacific precipitation and observed CSW Asia precipitation result in modest, but statistically significant, cross-validated simulation skill in the northeast part of the domain for the period from 1951 to 1998. The statistical–dynamical method is also applied to recent (winter 1998/99 to 2002/03) multimodel, two-tier December–March precipitation forecasts initiated in October. This period includes 4 yr (winter of 1998/99 to 2001/02) of severe drought. Tercile probability forecasts are produced using ensemble-mean forecasts and forecast error estimates. The statistical–dynamical forecasts show enhanced probability of below-normal precipitation for the four drought years and capture the return to normal conditions in part of the region during the winter of 2002/03.Precipitation forecasting, Statistics, Climatic changes--Forecasting, Atmospheremkt14, lmg107, agb52International Research Institute for Climate and Society, Applied Physics and Applied MathematicsArticlesPrior Design for Dependent Dirichlet Processes: An Application to Marathon Modeling
https://academiccommons.columbia.edu/catalog/ac:195557
Pradier, Melanie F.; Ruiz, Francisco Jesus Rodriguez; Perez-Cruz, Fernando10.7916/D8SN08V7Fri, 30 Jun 2017 00:47:30 +0000This paper presents a novel application of Bayesian nonparametrics (BNP) for marathon data modeling. We make use of two well-known BNP priors, the single-p dependent Dirichlet process and the hierarchical Dirichlet process, in order to address two different problems. First, we study the impact of age, gender and environment on the runners’ performance. We derive a fair grading method that allows direct comparison of runners regardless of their age and gender. Unlike current grading systems, our approach is based not only on top world records, but on the performances of all runners. The presented methodology for comparison of densities can be adopted in many other applications straightforwardly, providing an interesting perspective to build dependent Dirichlet processes. Second, we analyze the running patterns of the marathoners in time, obtaining information that can be valuable for training purposes. We also show that these running patterns can be used to predict finishing time given intermediate interval measurements. We apply our models to New York City, Boston and London marathons.Marathon running, Running races--Data processing, Nonparametric statistics, Stochastic processes, Statistics, Information sciencefr2392Data Science InstituteArticlesAre We Ready for Mass Fatality Incidents? Preparedness of the US Mass Fatality Infrastructure
https://academiccommons.columbia.edu/catalog/ac:192811
Merrill, Jacqueline A.; Orr, Mark; Chen, Daniel; Zhi, Qi; Gershon, Robyn R.10.7916/D8125SF8Fri, 30 Jun 2017 00:45:26 +0000Objective To assess the preparedness of the US mass fatality infrastructure, we developed and tested metrics for 3 components of preparedness: organizational, operational, and resource sharing networks.
Methods In 2014, data were collected from 5 response sectors: medical examiners and coroners, the death care industry, health departments, faith-based organizations, and offices of emergency management. Scores were calculated within and across sectors and a weighted score was developed for the infrastructure.
Results A total of 879 respondents reported highly variable organizational capabilities: 15% had responded to a mass fatality incident (MFI); 42% reported staff trained for an MFI, but only 27% for an MFI involving hazardous contaminants. Respondents estimated that 75% of their staff would be willing and able to respond, but only 53% if contaminants were involved. Most perceived their organization as somewhat prepared, but 13% indicated “not at all.” Operational capability scores ranged from 33% (death care industry) to 77% (offices of emergency management). Network capability analysis found that only 42% of possible reciprocal relationships between resource-sharing partners were present. The cross-sector composite score was 51%; that is, half the key capabilities for preparedness were in place.
Conclusions The sectors in the US mass fatality infrastructure report suboptimal capability to respond. National leadership is needed to ensure sector-specific and infrastructure-wide preparedness for a large-scale MFI.Mass casualties, Medical care, Emergency management, Disaster medicine, Medical sciences, Health services administration, Statisticsjam119, rg405NursingArticlesMicronutrients in HIV: A Bayesian MetaAnalysis
https://academiccommons.columbia.edu/catalog/ac:210017
Carter, George M.; Indyk, Debbie; Johnson, Matthew S.; Andreae, Michael; Suslov, Kathryn; Busani, Sudharani; Esmaeili, Aryan; Sacks, Henry S.10.7916/D8WM1D7SFri, 30 Jun 2017 00:43:41 +0000Background: Approximately 28.5 million people living with HIV are eligible for treatment (CD4&500), but currently have no access to antiretroviral therapy. Reduced serum level of micronutrients is common in HIV disease. Micronutrient supplementation (MNS) may mitigate disease progression and mortality. Objectives: We synthesized evidence on the effect of micronutrient supplementation on mortality and rate of disease progression in HIV disease.
Methods: We searched MEDLINE, EMBASE, the Cochrane Central, AMED and CINAHL databases through December 2014, without language restriction, for studies of greater than 3 micronutrients versus any or no comparator. We built a hierarchical Bayesian random effects model to synthesize results. Inferences are based on the posterior distribution of the population effects; posterior distributions were approximated by Markov chain Monte Carlo in OpenBugs.
Principal Findings: From 2166 initial references, we selected 49 studies for full review and identified eight reporting on disease progression and/or mortality. Bayesian synthesis of data from 2,249 adults in three studies estimated the relative risk of disease progression in subjects on MNS vs. control as 0.62 (95% credible interval, 0.37, 0.96). Median number needed to treat is 8.4 (4.8, 29.9) and the Bayes Factor 53.4. Based on data reporting on 4,095 adults reporting mortality in 7 randomized controlled studies, the RR was 0.84 (0.38, 1.85), NNT is 25 (4.3, ∞).
Conclusions: MNS significantly and substantially slows disease progression in HIV+ adults not on ARV, and possibly reduces mortality. Micronutrient supplements are effective in reducing progression with a posterior probability of 97.9%. Considering MNS low cost and lack of adverse effects, MNS should be standard of care for HIV+ adults not yet on ARV.Trace elements in nutrition, HIV infections--Treatment, Public health, Health services administration, Medical care, Statisticsmsj2119Human DevelopmentArticlesStatistics of surface divergence and their relation to air-water gas transfer velocity
https://academiccommons.columbia.edu/catalog/ac:194442
Asher, William E.; Liang, Hanzhuang; Zappa, Christopher J.; Loewen, Mark R.; Mukto, Moniz A.; Litchendorf, Trina M.; Jessup, Andrew T.10.7916/D8571BVQFri, 30 Jun 2017 00:40:37 +0000Air-sea gas fluxes are generally defined in terms of the air/water concentration difference of the gas and the gas transfer velocity,kL. Because it is difficult to measure kLin the ocean, it is often parameterized using more easily measured physical properties. Surface divergence theory suggests that infrared (IR) images of the water surface, which contain information concerning the movement of water very near the air-water interface, might be used to estimatekL. Therefore, a series of experiments testing whether IR imagery could provide a convenient means for estimating the surface divergence applicable to air-sea exchange were conducted in a synthetic jet array tank embedded in a wind tunnel. Gas transfer velocities were measured as a function of wind stress and mechanically generated turbulence; laser-induced fluorescence was used to measure the concentration of carbon dioxide in the top 300 μm of the water surface; IR imagery was used to measure the spatial and temporal distribution of the aqueous skin temperature; and particle image velocimetry was used to measure turbulence at a depth of 1 cm below the air-water interface. It is shown that an estimate of the surface divergence for both wind-shear driven turbulence and mechanically generated turbulence can be derived from the surface skin temperature. The estimates derived from the IR images are compared to velocity field divergences measured by the PIV and to independent estimates of the divergence made using the laser-induced fluorescence data. Divergence is shown to scale withkLvalues measured using gaseous tracers as predicted by conceptual models for both wind-driven and mechanically generated turbulence.Ocean-atmosphere interaction, Divergence theorem, Gas flow--Mathematical models, Surface waves (Oceanography), Oceanography, Mathematics, Statisticscjz9Lamont-Doherty Earth ObservatoryArticlesDistributed Bayesian Computation and Self-Organized Learning in Sheets of Spiking Neurons with Local Lateral Inhibition
https://academiccommons.columbia.edu/catalog/ac:192253
Buesing, Lars; Habenschuss, Stefan; Bill, Johannes; Nessler, Bernhard; Maass, Wolfgang; Legenstein, Robert10.7916/D8862G4XThu, 29 Jun 2017 23:25:22 +0000During the last decade, Bayesian probability theory has emerged as a framework in cognitive science and neuroscience for describing perception, reasoning and learning of mammals. However, our understanding of how probabilistic computations could be organized in the brain, and how the observed connectivity structure of cortical microcircuits supports these calculations, is rudimentary at best. In this study, we investigate statistical inference and self-organized learning in a spatially extended spiking network model, that accommodates both local competitive and large-scale associative aspects of neural information processing, under a unified Bayesian account. Specifically, we show how the spiking dynamics of a recurrent network with lateral excitation and local inhibition in response to distributed spiking input, can be understood as sampling from a variational posterior distribution of a well-defined implicit probabilistic model. This interpretation further permits a rigorous analytical treatment of experience-dependent plasticity on the network level. Using machine learning theory, we derive update rules for neuron and synapse parameters which equate with Hebbian synaptic and homeostatic intrinsic plasticity rules in a neural implementation. In computer simulations, we demonstrate that the interplay of these plasticity rules leads to the emergence of probabilistic local experts that form distributed assemblies of similarly tuned cells communicating through lateral excitatory connections. The resulting sparse distributed spike code of a well-adapted network carries compressed information on salient input features combined with prior experience on correlations among them. Our theory predicts that the emergence of such efficient representations benefits from network architectures in which the range of local inhibition matches the spatial extent of pyramidal cells that share common afferent input.Neuroplasticity, Neurons, Inhibition, Bayesian statistical decision theory, Neurosciences, Molecular biology, StatisticsStatisticsArticlesThe WTO Dispute Settlement System: 1995-2010 Some Descriptive Statistics
https://academiccommons.columbia.edu/catalog/ac:192343
Mavroidis, Petros C.; Horn, Henrik; Johannesson, Louise10.7916/D8B27TZZThu, 29 Jun 2017 23:15:50 +0000This paper reports descriptive statistics based on the WTO Dispute Settlement Data Set (Ver. 3.0). The data set contains approximately 67 000 observations on a wide range of aspects of the Dispute Settlement (DS) system, and is exclusively based on official WTO documents. It covers all 426 WTO disputes initiated through the official filing of a Request for Consultations from January 1, 1995, until August 11, 2011, and for these disputes it includes events occurring until July 28, 2011.1 In this paper however, we will omit data pertaining to 2011 and only consider the full years 1995—2010. In order to shed some light on differences across WTO Members in participation in the DS system, we will divide Members into five groups, as specified in detail in Table 1. Broadly speaking, these groups are: G2 - The European Union (EU), and the United States (US); IND - Other industrialized countries; DEV - Developing countries other than LDC; LDC - Least developed countries; BIC - Brazil, India and China. The EU is taken to be EU-15, since the enlargements came relatively late during the period we cover. For the most part, the choice in this regard makes little difference quantitatively, since most of the 12 countries acceding to the EU in 2004 and 2007 have been relatively inactive in the WTO. The LDC group corresponds to the list of LDCs prepared by the United Nations. A more discretionary line is drawn between IND and DEV. We have classified under IND, OECD Members, the non-OECD Members among the 12 countries that most recently became members of the EU, those that are currently at an advanced stage of their accession negotiations, as well as countries that are not OECD Members but have a very high per capita income, such as Singapore. The DEV group consists of all countries which do not fit into either of the above mentioned categories, and are not BIC countries either. BIC refers to Brazil, India, and China: the sheer number of cases in which Brazil, India and China have participated, as well as their overall participation in WTO, led us to these three countries as a separate group. The paper is structured as follows: Section 2 highlights the evolution of the total use of the DS system; Section 3 discusses some aspects of participation of the groups defined above when acting as complainants or respondents; Section 4 deals with the subject-matter of disputes; Section 5 highlights a few aspects of countries’ success with regard to the legal claims they made before panels; Section 6 provides information as to the nationality and the appointment process of WTO panelists; Section 7 focuses on the duration of dispute settlement procedures at different stages of the adjudication process; Section 8 concludes.Dispute resolution (Law), Statistics, Developed countries, Developing countries, Law, International law, World Trade Organization, European Unionpm2030LawArticlesIdentification and Validation of Structures in Neural Population Responses
https://academiccommons.columbia.edu/catalog/ac:b2rbnzs7j6
Elsayed, Gamaleldin Fathy10.7916/D8G73S1WThu, 29 Jun 2017 19:21:36 +0000A fundamental challenge of neuroscience is to understand how interconnected populations of neurons give rise to the remarkable computational abilities of our brains. Large neural datasets offer promise, but they are perilous: they are too complex to be studied with traditional single-neuron analyses, and thus require new analyses that can uncover structure at the level of the population. However, since these analyses operate on large datasets, our intuition whether structure is significant breaks down. Hence, we run the risk of over-interpreting structure from the population data that may have a simple explanation. Thus, with population analysis methods, there is also a need for methods that can validate the significance of structure identified. In this dissertation, I discuss topics covering both the identification and the validation of structure in population data. In the first part, I discuss novel methods for uncovering the computational strategy employed by the motor cortex to flexibly switch between different neural computations. I demonstrate that collective activity patterns of motor cortex neurons related to different computations are orthogonal yet can still be linked, indicating a degree of flexibility that was not displayed or predicted by existing cortical models. In the second part, I discuss a novel analytical framework to rigorously test the novelty of population-level findings, given a specified set of primary features such as correlations across time, neurons and experimental conditions. This framework provides a general tool for validating population findings across the brain and across population-level hypotheses.Neurosciences, Electrical engineering, Statistics, Neuronsgfa2109Neurobiology and BehaviorThesesGLMLE: graph-limit enabled fast computation for fitting exponential random graph models to large social networks
https://academiccommons.columbia.edu/catalog/ac:185410
He, Ran; Zheng, Tian10.7916/D8S46QVQThu, 29 Jun 2017 03:44:14 +0000Large network, as a form of big data, has received increasing amount of attention in data science, especially for large social network, which is reaching the size of hundreds of millions, with daily interactions on the scale of billions. Thus analyzing and modeling these data to understand the connectivities and dynamics of large networks is important in a wide range of scientific fields. Among popular models, exponential random graph models (ERGMs) have been developed to study these complex networks by directly modeling network structures and features. ERGMs, however, are hard to scale to large networks because maximum likelihood estimation of parameters in these models can be very difficult, due to the unknown normalizing constant. Alternative strategies based on Markov chain Monte Carlo (MCMC) draw samples to approximate the likelihood, which is then maximized to obtain the maximum likelihood estimators (MLE). These strategies have poor convergence due to model degeneracy issues and cannot be used on large networks. Chatterjee et al. (Ann Stat 41:2428–2461, 2013) propose a new theoretical framework for estimating the parameters of ERGMs by approximating the normalizing constant using the emerging tools in graph theory—graph limits. In this paper, we construct a complete computational procedure built upon their results with practical innovations which is fast and is able to scale to large networks. More specifically, we evaluate the likelihood via simple function approximation of the corresponding ERGM’s graph limit and iteratively maximize the likelihood to obtain the MLE. We also discuss the methods of conducting likelihood ratio test for ERGMs as well as related issues. Through simulation studies and real data analysis of two large social networks, we show that our new method outperforms the MCMC-based method, especially when the network size is large (more than 100 nodes). One limitation of our approach, inherited from the limitation of the result of Chatterjee et al. (Ann Stat 41:2428–2461, 2013), is that it works only for sequences of graphs with a positive limiting density, i.e., dense graphs.Statisticsrh2528, tz33StatisticsArticlesEffect of Childhood Victimization on Occupational Prestige and Income Trajectories
https://academiccommons.columbia.edu/catalog/ac:184166
Fernandez, Cristina A.; Christ, Sharon L.; LeBlanc, William G.; McCollister, Kathyrn E.; Arheart, Kristopher L.; Dietz, Noella A.; Fleming, Lora E.; Muntaner, Carles; Muennig, Peter A.; Lee, David J.10.7916/D88C9V3DThu, 29 Jun 2017 03:43:09 +0000Background
Violence toward children (childhood victimization) is a major public health problem, with long-term consequences on economic well-being. The purpose of this study was to determine whether childhood victimization affects occupational prestige and income in young adulthood. We hypothesized that young adults who experienced more childhood victimizations would have less prestigious jobs and lower incomes relative to those with no victimization history. We also explored the pathways in which childhood victimization mediates the relationships between background variables, such as parent’s educational impact on the socioeconomic transition into adulthood.
Methods
A nationally representative sample of 8,901 young adults aged 18–28 surveyed between 1999–2009 from the National Longitudinal Survey of Youth 1997 (NLSY) were analyzed. Covariate-adjusted multivariate linear regression and path models were used to estimate the effects of victimization and covariates on income and prestige levels and on income and prestige trajectories. After each participant turned 18, their annual 2002 Census job code was assigned a yearly prestige score based on the 1989 General Social Survey, and their annual income was calculated via self-reports. Occupational prestige and annual income are time-varying variables measured from 1999–2009. Victimization effects were tested for moderation by sex, race, and ethnicity in the multivariate models.
Results
Approximately half of our sample reported at least one instance of childhood victimization before the age of 18. Major findings include 1) childhood victimization resulted in slower income and prestige growth over time, and 2) mediation analyses suggested that this slower prestige and earnings arose because victims did not get the same amount of education as non-victims.
Conclusions
Results indicated that the consequences of victimization negatively affected economic success throughout young adulthood, primarily by slowing the growth in prosperity due to lower education levels.Public health, Sociology, Statisticspm124Health Policy and ManagementArticlesDrinking Patterns and Alcohol Use Disorders in São Paulo, Brazil: The Role of Neighborhood Social Deprivation and Socioeconomic Status
https://academiccommons.columbia.edu/catalog/ac:184775
Silveira, Camila Magalhaes; Siu, Erica Rosanna; Anthony, James C.; Saito, Luis Paulo; Guerra de Andrade, Arthur; Kutschenko, Andressa; Viana, Maria Carmen; Wang, Yuan-Pang; Martins, Silvia S.; Andrade, Laura Helena10.7916/D89C6W9PThu, 29 Jun 2017 03:42:00 +0000Background
Research conducted in high-income countries has investigated influences of socioeconomic inequalities on drinking outcomes such as alcohol use disorders (AUD), however, associations between area-level neighborhood social deprivation (NSD) and individual socioeconomic status with these outcomes have not been explored in Brazil. Thus, we investigated the role of these factors on drink-related outcomes in a Brazilian population, attending to male-female variations.
Methods
A multi-stage area probability sample of adult household residents in the São Paulo Metropolitan Area was assessed using the WHO Composite International Diagnostic Interview (WMH-CIDI) (n = 5,037). Estimation focused on prevalence and correlates of past-year alcohol disturbances [heavy drinking of lower frequency (HDLF), heavy drinking of higher frequency (HDHF), abuse, dependence, and DMS-5 AUD] among regular users (RU); odds ratio (OR) were obtained.
Results
Higher NSD, measured as an area-level variable with individual level variables held constant, showed an excess odds for most alcohol disturbances analyzed. Prevalence estimates for HDLF and HDHF among RU were 9% and 20%, respectively, with excess odds in higher NSD areas; schooling (inverse association) and low income were associated with male HDLF. The only individual-level association with female HDLF involved employment status. Prevalence estimates for abuse, dependence, and DSM-5 AUD among RU were 8%, 4%, and 8%, respectively, with excess odds of: dependence in higher NSD areas for males; abuse and AUD for females. Among RU, AUD was associated with unemployment, and low education with dependence and AUD.Social sciences--Research, Statistics, Public healthssm2183EpidemiologyArticlesDynamical Phenotyping: Using Temporal Analysis of Clinically Collected Physiologic Data to Stratify Populations
https://academiccommons.columbia.edu/catalog/ac:184147
Albers, David J.; Elhadad, Noemie; Tabak, E.; Perotte, Adler; Hripcsak, George M.10.7916/D8W9581VThu, 29 Jun 2017 03:41:56 +0000Using glucose time series data from a well measured population drawn from an electronic health record (EHR) repository, the variation in predictability of glucose values quantified by the time-delayed mutual information (TDMI) was explained using a mechanistic endocrine model and manual and automated review of written patient records. The results suggest that predictability of glucose varies with health state where the relationship (e.g., linear or inverse) depends on the source of the acuity. It was found that on a fine scale in parameter variation, the less insulin required to process glucose, a condition that correlates with good health, the more predictable glucose values were. Nevertheless, the most powerful effect on predictability in the EHR subpopulation was the presence or absence of variation in health state, specifically, in- and out-of-control glucose versus in-control glucose. Both of these results are clinically and scientifically relevant because the magnitude of glucose is the most commonly used indicator of health as opposed to glucose dynamics, thus providing for a connection between a mechanistic endocrine model and direct insight to human health via clinically collected data.Medicine, Endocrinology, Statisticsdja2119, ne60, ajp2120, gh13Biomedical InformaticsArticlesStatistical Searches for Microlensing Events in Large, Non-Uniformly Sampled Time-Domain Surveys: A Test Using Palomar Transient Factory Data
https://academiccommons.columbia.edu/catalog/ac:185413
Price-Whelan, Adrian Michael; Agüeros, Marcel Andre; Fournier, Amanda P.; Street, Rachel; Ofek, Eran O.; Covey, Kevin R.; Levitan, David; Laher, Russ R.; Sesar, Branimir; Surace, Jason10.7916/D8HM57CMThu, 29 Jun 2017 03:41:47 +0000Many photometric time-domain surveys are driven by specific goals, such as searches for supernovae or transiting exoplanets, which set the cadence with which fields are re-imaged. In the case of the Palomar Transient Factory (PTF), several sub-surveys are conducted in parallel, leading to non-uniform sampling over its ~20,000 deg2 footprint. While the median 7.26 deg2 PTF field has been imaged ~40 times in the R band, ~2300 deg2 have been observed >100 times. We use PTF data to study the trade off between searching for microlensing events in a survey whose footprint is much larger than that of typical microlensing searches, but with far-from-optimal time sampling. To examine the probability that microlensing events can be recovered in these data, we test statistics used on uniformly sampled data to identify variables and transients. We find that the von Neumann ratio performs best for identifying simulated microlensing events in our data. We develop a selection method using this statistic and apply it to data from fields with >10 R-band observations, 1.1 × 109 light curves, uncovering three candidate microlensing events. We lack simultaneous, multi-color photometry to confirm these as microlensing events. However, their number is consistent with predictions for the event rate in the PTF footprint over the survey's three years of operations, as estimated from near-field microlensing models. This work can help constrain all-sky event rate predictions and tests microlensing signal recovery in large data sets, which will be useful to future time-domain surveys, such as that planned with the Large Synoptic Survey Telescope.Astronomy, Statisticsamp2217, maa17AstronomyArticlesSurveying Hard-to-Reach Groups Through Sampled Respondents in a Social Network
https://academiccommons.columbia.edu/catalog/ac:185373
McCormick, Tyler H.; Zheng, Tian; He, Ran; Kolaczyk, Eric10.7916/D8Z0372NThu, 29 Jun 2017 03:41:08 +0000The sampling frame in most social science surveys misses members of certain groups, such as the homeless or individuals living with HIV. These groups are known as hard-to-reach groups. One strategy for learning about these groups, or subpopulations, involves reaching hard-to-reach group members through their social network. In this paper we compare the efficiency of two common methods for subpopulation size estimation using data from standard surveys. These designs are examples of mental link tracing designs. These designs begin with a randomly sampled set of network members (nodes) and then reach other nodes indirectly through questions asked to the sampled nodes. Mental link tracing designs cost significantly less than traditional link tracing designs, yet introduce additional sources of potential bias. We examine the influence of one such source of bias using simulation studies. We then demonstrate our findings using data from the General Social Survey collected in 2004 and 2006. Additionally, we provide survey design suggestions for future surveys incorporating such designs.Statistics, Social sciences--Researchthm2105, tz33, rh2528StatisticsArticlesLatent demographic profile estimation in hard-to-reach groups
https://academiccommons.columbia.edu/catalog/ac:184956
McCormick, Tyler H.; Zheng, Tian10.7916/D8F76BFQThu, 29 Jun 2017 03:41:07 +0000The sampling frame in most social science surveys excludes members of certain groups, known as hard-to-reach groups. These groups, or subpopulations, may be difficult to access (the homeless, e.g.), camouflaged by stigma (individuals with HIV/AIDS), or both (commercial sex workers). Even basic demographic information about these groups is typically unknown, especially in many developing nations. We present statistical models which leverage social network structure to estimate demographic characteristics of these subpopulations using Aggregated relational data (ARD), or questions of the form “How many X’s do you know?” Unlike other network-based techniques for reaching these groups, ARD require no special sampling strategy and are easily incorporated into standard surveys. ARD also do not require respondents to reveal their own group membership. We propose a Bayesian hierarchical model for estimating the demographic characteristics of hard-to-reach groups, or latent demographic profiles, using ARD. We propose two estimation techniques. First, we propose a Markov-chain Monte Carlo algorithm for existing data or cases where the full posterior distribution is of interest. For cases when new data can be collected, we propose guidelines and, based on these guidelines, propose a simple estimate motivated by a missing data approach. Using data from McCarty et al. [Human Organization 60 (2001) 28–39], we estimate the age and gender profiles of six hard-to-reach groups, such as individuals who have HIV, women who were raped, and homeless persons. We also evaluate our simple estimates using simulation studies.Statisticsthm2105, tz33StatisticsArticlesA Practical Guide to Measuring Social Structure Using Indirectly Observed Network Data
https://academiccommons.columbia.edu/catalog/ac:185370
McCormick, Tyler H.; Moussa, Amal; DiPrete, Thomas A.; Ruf, Johannes; Gelman, Andrew E.; Teitler, Julien O.; Zheng, Tian10.7916/D86H4G9DThu, 29 Jun 2017 03:41:05 +0000Aggregated relational data (ARD) are an increasingly common tool for learning about social networks through standard surveys. Recent statistical advances present social scientists with new options for analyzing such data. In this article, we propose guidelines for learning about various network processes using ARD and a template to aid practitioners. We first propose that ARD can be used to measure “social distance” between a respondent and a subpopulation (individuals named Kevin, those in prison, or those serving in the military). We then present common methods for analyzing these data and associate each of these methods with a specific way of measuring social distance, thus associating statistical tools with their underlying social science phenomena. We examine the implications of using each of these social distance measures using an Internet survey about contemporary political issues.Statistics, Social sciences--Researchthm2105, am2810, tad61, ag389, jot8, tz33Sociology, Statistics, Social WorkArticlesBiodiversity and Ecosystem Multi-Functionality: Observed Relationships in Smallholder Fallows in Western Kenya
https://academiccommons.columbia.edu/catalog/ac:184813
Sircely, Jason; Naeem, Shahid10.7916/D8V986XHThu, 29 Jun 2017 03:40:59 +0000Recent studies indicate that species richness can enhance the ability of plant assemblages to support multiple ecosystem functions. To understand how and when ecosystem services depend on biodiversity, it is valuable to expand beyond experimental grasslands. We examined whether plant diversity improves the capacity of agroecosystems to sustain multiple ecosystem services—production of wood and forage, and two elements of soil formation—in two types of smallholder fallows in western Kenya. In 18 grazed and 21 improved fallows, we estimated biomass and quantified soil organic carbon, soil base cations, sand content, and soil infiltration capacity. For four ecosystem functions (wood biomass, forage biomass, soil base cations, steady infiltration rates) linked to the focal ecosystem services, we quantified ecosystem service multi-functionality as (1) the proportion of functions above half-maximum, and (2) mean percentage excess above mean function values, and assessed whether plant diversity or environmental favorability better predicted multi-functionality. In grazed fallows, positive effects of plant diversity best explained the proportion above half-maximum and mean percentage excess, the former also declining with grazing intensity. In improved fallows, the proportion above half-maximum was not associated with soil carbon or plant diversity, while soil carbon predicted mean percentage excess better than diversity. Grazed fallows yielded stronger evidence for diversity effects on multi-functionality, while environmental conditions appeared more influential in improved fallows. The contrast in diversity-multi-functionality relationships among fallow types appears related to differences in management and associated factors including disturbance and species composition. Complementary effects of species with contrasting functional traits on different functions and multi-functional species may have contributed to diversity effects in grazed fallows. Biodiversity and environmental favorability may enhance the capacity of smallholder fallows to simultaneously provide multiple ecosystem services, yet their effects are likely to vary with fallow management.Ecology, Statisticsjas2162, sn2121Ecology, Evolution, and Environmental BiologyArticlesPopulation Physiology: Leveraging Electronic Health Record Data to Understand Human Endocrine Dynamics
https://academiccommons.columbia.edu/catalog/ac:184150
Albers, David J.; Hripcsak, George M.; Schmidt, J. Michael10.7916/D8KW5DWSThu, 29 Jun 2017 03:40:36 +0000Studying physiology and pathophysiology over a broad population for long periods of time is difficult primarily because collecting human physiologic data can be intrusive, dangerous, and expensive. One solution is to use data that have been collected for a different purpose. Electronic health record (EHR) data promise to support the development and testing of mechanistic physiologic models on diverse populations and allow correlation with clinical outcomes, but limitations in the data have thus far thwarted such use. For example, using uncontrolled population-scale EHR data to verify the outcome of time dependent behavior of mechanistic, constructive models can be difficult because: (i) aggregation of the population can obscure or generate a signal, (ii) there is often no control population with a well understood health state, and (iii) diversity in how the population is measured can make the data difficult to fit into conventional analysis techniques. This paper shows that it is possible to use EHR data to test a physiological model for a population and over long time scales. Specifically, a methodology is developed and demonstrated for testing a mechanistic, time-dependent, physiological model of serum glucose dynamics with uncontrolled, population-scale, physiological patient data extracted from an EHR repository. It is shown that there is no observable daily variation the normalized mean glucose for any EHR subpopulations. In contrast, a derived value, daily variation in nonlinear correlation quantified by the time-delayed mutual information (TDMI), did reveal the intuitively expected diurnal variation in glucose levels amongst a random population of humans. Moreover, in a population of continuously (tube) fed patients, there was no observable TDMI-based diurnal signal. These TDMI-based signals, via a glucose insulin model, were then connected with human feeding patterns. In particular, a constructive physiological model was shown to correctly predict the difference between the general uncontrolled population and a subpopulation whose feeding was controlled.Statistics, Medicinedja2119, gh13, mjs2134Biomedical Informatics, NeurologyArticlesPolymorphisms in the Mitochondrial DNA Control Region and Frailty in Older Adults
https://academiccommons.columbia.edu/catalog/ac:184807
Moore, Anne Z.; Biggs, Mary L.; O'Connor, Ashley; Matteini, Amy; McGuire, Sarah; Beamer, Brock A.; Fallin, M. Danielle; Waltson, Jeremy; Fried, Linda P.; Chakravarti, Aravinda; Arking, Dan E.10.7916/D83R0RRHThu, 29 Jun 2017 03:40:06 +0000Background:
Mitochondria contribute to the dynamics of cellular metabolism, the production of reactive oxygen species, and apoptotic pathways. Consequently, mitochondrial function has been hypothesized to influence functional decline and vulnerability to disease in later life. Mitochondrial genetic variation may contribute to altered susceptibility to the frailty syndrome in older adults.
Methodology/Principal Findings:
To assess potential mitochondrial genetic contributions to the likelihood of frailty, mitochondrial DNA (mtDNA) variation was compared in frail and non-frail older adults. Associations of selected SNPs with a muscle strength phenotype were also explored. Participants were selected from the Cardiovascular Health Study (CHS), a population-based observational study (1989–1990, 1992–1993). At baseline, frailty was identified as the presence of three or more of five indicators (weakness, slowness, shrinking, low physical activity, and exhaustion). mtDNA variation was assessed in a pilot study, including 315 individuals selected as extremes of the frailty phenotype, using an oligonucleotide sequencing microarray based on the Revised Cambridge Reference Sequence. Three mtDNA SNPs were statistically significantly associated with frailty across all pilot participants or in sex-stratified comparisons: mt146, mt204, and mt228. In addition to pilot participants, 4,459 additional men and women with frailty classifications, and an overlapping subset of 4,453 individuals with grip strength measurements, were included in the study population genotyped at mt204 and mt228. In the study population, the mt204 C allele was associated with greater likelihood of frailty (adjusted odds ratio = 2.04, 95% CI = 1.07–3.60, p = 0.020) and lower grip strength (adjusted coefficient = −2.04, 95% CI = −3.33– −0.74, p = 0.002).
Conclusions:
This study supports a role for mitochondrial genetic variation in the frailty syndrome and later life muscle strength, demonstrating the importance of the mitochondrial genome in complex geriatric phenotypes.Genetics, Medicine, Statisticslf2296EpidemiologyArticlesHow many people do you know?: Efficiently estimating personal network size
https://academiccommons.columbia.edu/catalog/ac:185367
Zheng, Tian; Salganik, Matthew J.; McCormick, Tyler H.10.7916/D8FX78BTThu, 29 Jun 2017 03:40:03 +0000In this paper we develop a method to estimate both individual social network size (i.e., degree) and the distribution of network sizes in a population by asking respondents how many people they know in specific subpopulations (e.g., people named Michael). Building on the scale-up method of Killworth et al. and other previous attempts to estimate individual network size, we propose a latent non-random mixing model which resolves three known problems with previous approaches. As a byproduct, our method also provides estimates of the rate of social mixing between population groups. We demonstrate the model using a sample of 1,370 adults originally collected by McCarty et al. (2001). Based on insights developed during the statistical modeling, we conclude by offering practical guidelines for the design of future surveys to estimate social network size. Most importantly, we show that if the first names to be asked about are chosen properly, the simple scale-up degree estimates can enjoy the same bias-reduction as that from the our more complex latent non-random mixing model.Statistics, Social sciences--Researchtz33, thm2105StatisticsArticlesOn Bootstrap Tests of Symmetry About an Unknown Median
https://academiccommons.columbia.edu/catalog/ac:184965
Zheng, Tian; Gastwirth, Joseph L.10.7916/D8X9296PThu, 29 Jun 2017 03:40:03 +0000It is important to examine the symmetry of an underlying distribution before applying some statistical procedures to a data set. For example, in the Zuni School District case, a formula originally developed by the Department of Education trimmed 5% of the data symmetrically from each end. The validity of this procedure was questioned at the hearing by Chief Justice Roberts. Most tests of symmetry (even nonparametric ones) are not distribution free in finite sample sizes. Hence, using asymptotic distribution may not yield an accurate type I error rate or/and loss of power in small samples. Bootstrap resampling from a symmetric empirical distribution function fitted to the data is proposed to improve the accuracy of the calculated p-value of several tests of symmetry. The results show that the bootstrap method is superior to previously used approaches relying on the asymptotic distribution of the tests that assumed the data come from a normal distribution. Incorporating the bootstrap estimate in a recently proposed test due to Miao, Gel and Gastwirth (2006) preserved its level and shows it has reasonable power properties on the family of distribution evaluated.Statisticstz33StatisticsArticlesDiscovering influential variables: A method of partitions
https://academiccommons.columbia.edu/catalog/ac:184953
Chernoff, Herman; Lo, Shaw-Hwa; Zheng, Tian10.7916/D8PR7TVMThu, 29 Jun 2017 03:39:38 +0000A trend in all scientific disciplines, based on advances in technology, is the increasing availability of high dimensional data in which are buried important information. A current urgent challenge to statisticians is to develop effective methods of finding the useful information from the vast amounts of messy and noisy data available, most of which are noninformative. This paper presents a general computer intensive approach, based on a method pioneered by Lo and Zheng for detecting which, of many potential explanatory variables, have an influence on a dependent variable Y. This approach is suited to detect influential variables, where causal effects depend on the confluence of values of several variables. It has the advantage of avoiding a difficult direct analysis, involving possibly thousands of variables, by dealing with many randomly selected small subsets from which smaller subsets are selected, guided by a measure of influence I. The main objective is to discover the influential variables, rather than to measure their effects. Once they are detected, the problem of dealing with a much smaller group of influential variables should be vulnerable to appropriate analysis. In a sense, we are confining our attention to locating a few needles in a haystack.Computer science, Statisticsshl5, tz33StatisticsArticlesSigns of the 2009 Influenza Pandemic in the New York-Presbyterian Hospital Electronic Health Records
https://academiccommons.columbia.edu/catalog/ac:184153
Khiabanian, Hossein; Holmes, Antony B.; Kelly, Brendan J.; Gururaj, Mrinalini; Hripcsak, George M.; Rabadan, Raul10.7916/D82V2F0DThu, 29 Jun 2017 03:39:38 +0000Background
In June of 2009, the World Health Organization declared the first influenza pandemic of the 21st century, and by July, New York City's New York-Presbyterian Hospital (NYPH) experienced a heavy burden of cases, attributable to a novel strain of the virus (H1N1pdm).
Methods and Results
We present the signs in the NYPH electronic health records (EHR) that distinguished the 2009 pandemic from previous seasonal influenza outbreaks via various statistical analyses. These signs include (1) an increase in the number of patients diagnosed with influenza, (2) a preponderance of influenza diagnoses outside of the normal flu season, and (3) marked vaccine failure. The NYPH EHR also reveals distinct age distributions of patients affected by seasonal influenza and the pandemic strain, and via available longitudinal data, suggests that the two may be associated with distinct sets of comorbid conditions as well. In particular, we find significantly more pandemic flu patients with diagnoses associated with asthma and underlying lung disease. We further observe that the NYPH EHR is capable of tracking diseases at a resolution as high as particular zip codes in New York City.
Conclusion
The NYPH EHR permits early detection of pandemic influenza and hypothesis generation via identification of those significantly associated illnesses. As data standards develop and databases expand, EHRs will contribute more and more to disease detection and the discovery of novel disease associations.Medicine, Statistics, Public healthhk2524, abh2138, gh13, rr2579Biomedical InformaticsArticlesComment: Quantifying the Fraction of Missing Information for Hypothesis Testing in Statistical and Genetic Studies
https://academiccommons.columbia.edu/catalog/ac:184983
Zheng, Tian; Lo, Shaw-Hwa10.7916/D84T6H8MThu, 29 Jun 2017 03:39:16 +0000The authors suggest an interesting way to measure
the fraction of missing information in the context of
hypothesis testing. The measure seeks to quantify the
impact of missing observations on the test between two
hypotheses. The amount of impact can be useful information
for applied research. An example is, in genetics,
where multiple tests of the same sort are performed
on different variables with different missing rates, and
follow-up studies may be designed to resolve missing
values in selected variables.
In this discussion, we offer our prospective views on
the use of relative information in a follow-up study.
For studies where the impact of missing observations
varies greatly across different variables and where the
investigators have the flexibility of designing studies
that can have different efforts on variables, an optimal
design may be derived using relative information measures
to improve the cost-effectiveness of the followup.Statisticstz33, shl5StatisticsArticlesHow Many People Do You Know in Prison? Using Overdispersion in Count Data to Estimate Social Structure in Networks
https://academiccommons.columbia.edu/catalog/ac:185364
Zheng, Tian; Salganik, Matthew J.; Gelman, Andrew E.10.7916/D800011WThu, 29 Jun 2017 03:38:21 +0000Networks—sets of objects connected by relationships—are important in a number of fields. The study of networks has long been central to sociology, where researchers have attempted to understand the causes and consequences of the structure of relationships in large groups of people. Using insight from previous network research, Killworth et al. and McCarty et al. have developed and evaluated a method for estimating the sizes of hard-to-count populations using network data collected from a simple random sample of Americans. In this article we show how, using a multilevel overdispersed Poisson regression model, these data also can be used to estimate aspects of social structure in the population. Our work goes beyond most previous research on networks by using variation, as well as average responses, as a source of information. We apply our method to the data of McCarty et al. and find that Americans vary greatly in their number of acquaintances. Further, Americans show great variation in propensity to form ties to people in some groups (e.g., males in prison, the homeless, and American Indians), but little variation for other groups (e.g., twins, people named Michael or Nicole). We also explore other features of these data and consider ways in which survey data can be used to estimate network structure.Statistics, Social sciences--Researchtz33, ag389Statistics, Political ScienceArticlesBackward Genotype-Trait Association (BGTA)-Based Dissection of Complex Traits in Case-Control Designs
https://academiccommons.columbia.edu/catalog/ac:185325
Zheng, Tian; Wang, Hui; Lo, Shaw-Hwa10.7916/D8SF2V33Thu, 29 Jun 2017 03:38:20 +0000Background: The studies of complex traits project new challenges to current methods that evaluate association between genotypes and a specific trait. Consideration of possible interactions among loci leads to overwhelming dimensions that cannot be handled using current statistical methods. Methods: In this article, we evaluate a multi-marker screening algorithm--the backward genotype-trait association (BGTA) algorithm for case-control designs, which uses unphased multi-locus genotypes. BGTA carries out a global investigation on a candidate marker set and automatically screens out markers carrying diminutive amounts of information regarding the trait in question. To address the "too many possible genotypes, too few informative chromosomes" dilemma of a genomic-scale study that consists of hundreds to thousands of markers, we further investigate a BGTA-based marker selection procedure, in which the screening algorithm is repeated on a large number of random marker subsets. Results of these screenings are then aggregated into counts that the markers are retained by the BGTA algorithm. Markers with exceptional high counts of returns are selected for further analysis. Results and Conclusion: Evaluated using simulations under several disease models, the proposed methods prove to be more powerful in dealing with epistatic traits.We also demonstrate the proposed methods through an application to a study on the inflammatory bowel disease.Statistics, Genetics, Biometrytz33, hw2334, shl5Statistics, Microbiology and Immunology, BiostatisticsArticlesEstimating Preferences under Risk: The Case of Racetrack Bettors
https://academiccommons.columbia.edu/catalog/ac:184178
Jullien, Bruno; Salanie, Bernard10.7916/D8S75F6JThu, 29 Jun 2017 03:36:35 +0000In this paper we investigate the attitudes toward risk of bettors in British horse races. The model we use allows us to go beyond the expected utility framework and to explore various alternative proposals by estimating a multinomial model on a 34,443‐race data set. We find that rank‐dependent utility models do not fit the data noticeably better than expected utility models. On the other hand, cumulative prospect theory has higher explanatory power. Our preferred estimates suggest a pattern of local risk aversion similar to that proposed by Friedman and Savage.Economics, Statisticsbs2237EconomicsArticlesPrélèvements et transferts sociaux: une analyse descriptive des incitations financières au travail
https://academiccommons.columbia.edu/catalog/ac:184353
Laroque, Guy; Salanie, Bernard10.7916/D88914Q2Thu, 29 Jun 2017 03:36:21 +0000Un ensemble complexe de prélèvements et de transferts sociaux s’interpose entre la rémunération versée aux ménages et le revenu dont ils disposeront effectivement. D’un côté, cotisations sociales, impôts et taxes viennent grever ce revenu ; de l’autre, prestations sociales et allocations l’augmentent. Mais le fonctionnement de ce système a des conséquences variables sur le niveau du revenu disponible d’un ménage en fonction des caractéristiques de ce ménage (situation du conjoint, nombre d’enfants) et du niveau de ses revenus (RMI, bas salaires). Jusqu’à présent, ce fonctionnement n’était décrit qu’à travers l’analyse de cas-types. L’application de ce système à un échantillon représentatif d’une partie de la population française (près de 20 millions d’individus) permet, en plus, d’étudier la répartition des taux nets de prélèvement dans cette sous-population.
Des exercices de simulation réalisés, il ressort que ce sont les ménages ayant les revenus les plus bas qui ont les taux marginaux de prélèvement les plus hauts, ce qui peut avoir pour effet de limiter les effets des incitations financières à la reprise d’un emploi. En particulier, l’incitation financière à reprendre un emploi payé au Smic paraît faible pour nombre des chômeurs et des inactifs.Economics, Statistics, Labor economics, Social sciences--Researchbs2237EconomicsArticlesPreaching to the Unconverted
https://academiccommons.columbia.edu/catalog/ac:179470
Uriarte, Maria; Yackulic, Charles B.10.7916/D8SB44FMThu, 29 Jun 2017 02:57:01 +0000Rapid advances in computing in the past 20 years
have lead to an explosion in the development and
adoption of new statistical modeling tools (Gelman and
Hill 2006, Clark 2007, Bolker 2008, Cressie et al. 2009).
These innovations have occurred in parallel with a
tremendous increase in the availability of ecological
data. The latter has been fueled both by new tools that
have facilitated data collection and management efforts
(e.g., remote sensing, database management software,
and so on) and by increased ease of data sharing
through computers and the World Wide Web. The
impending implementation of the National Ecological
Observatory Network (NEON) will further boost data
availability. These rapid advances in the ability of
ecologists to collect data have not been matched by
application of modern statistical tools. Given the critical
questions ecology is facing (e.g., climate change, species
extinctions, spread of invasives, irreversible losses of
ecosystem services) and the benefits that can be gained
from connecting existing data to models in a sophisticated
inferential framework (Clark et al. 2001, Pielke
and Connant 2003), it is important to understand why
this mismatch exists. Such an understanding would
point to the issues that must be addressed if ecologists
are to make useful inferences from these new data and
tools and contribute in substantial ways to management
and decision making.Ecology, Statisticsmu2126Ecology, Evolution, and Environmental BiologyArticlesNew insights into old methods for identifying causal rare variants
https://academiccommons.columbia.edu/catalog/ac:195277
Hu, Inchi; Zheng, Tian; Huang, Chien-Hsun; Lo, Shaw-Hwa; Wang, Haitian10.7916/D8J38R1MWed, 28 Jun 2017 21:04:19 +0000The advance of high-throughput next-generation sequencing technology makes possible the analysis of rare variants. However, the investigation of rare variants in unrelated-individuals data sets faces the challenge of low power, and most methods circumvent the difficulty by using various collapsing procedures based on genes, pathways, or gene clusters. We suggest a new way to identify causal rare variants using the F-statistic and sliced inverse regression. The procedure is tested on the data set provided by the Genetic Analysis Workshop 17 (GAW17). After preliminary data reduction, we ranked markers according to their F-statistic values. Top-ranked markers were then subjected to sliced inverse regression, and those with higher absolute coefficients in the most significant sliced inverse regression direction were selected. The procedure yields good false discovery rates for the GAW17 data and thus is a promising method for future study on rare variants.Human genetics--Variation, Biometry--Statistical methods, Statistics--Methodology, Biometry, Statisticstz33, ch2526, shl5StatisticsArticlesUsing individual growth model to analyze the change in quality of life from adolescence to adulthood
https://academiccommons.columbia.edu/catalog/ac:192019
Chen, Henian; Cohen, Patricia R.10.7916/D8805135Wed, 28 Jun 2017 21:01:16 +0000Background: The individual growth model is a relatively new statistical technique now widely used to examine the unique trajectories of individuals and groups in repeated measures data. This technique is increasingly used to analyze the changes over time in quality of life (QOL) data. This study examines the change from adolescence to adulthood in physical health as an aspect of QOL as an illustration of the use of this analytic method.
Methods: Employing data from the Children in the Community (CIC) study, a prospective longitudinal investigation, physical health was assessed at mean ages 16, 22, and 33 in 752 persons born between 1965 and 1975.
Results: The analyses using individual growth models show a linear decline in average physical health from age 10 to age 40. Males reported better physical health and declined less per year on average. Time-varying psychiatric disorders accounted for 8.6% of the explained variation in mean physical health, and 6.7% of the explained variation in linear change in physical health. Those with such a disorder reported lower mean physical health and a more rapid decline with age than those without a current psychiatric disorder. The use of SAS PROC MIXED, including syntax and interpretation of output are provided. Applications of these models including statistical assumptions, centering issues and cohort effects are discussed.
Conclusion: This paper highlights the usefulness of the individual growth model in modeling longitudinal change in QOL variables.Human growth--Mathematical models, Quality of life--Statistical methods, Young adults--Health and hygiene, Medical sciences, Aging, Statisticsprc2Psychiatry, EpidemiologyArticlesBAMarray™: Java software for Bayesian analysis of variance for microarray data
https://academiccommons.columbia.edu/catalog/ac:192099
Ishwaran, Hemant; Rao, J. Sunil; Kogalur, Udaya B.10.7916/D8BR8QNZWed, 28 Jun 2017 21:00:37 +0000Background: DNA microarrays open up a new horizon for studying the genetic determinants of disease. The high throughput nature of these arrays creates an enormous wealth of information, but also poses a challenge to data analysis. Inferential problems become even more pronounced as experimental designs used to collect data become more complex. An important example is multigroup data collected over different experimental groups, such as data collected from distinct stages of a disease process. We have developed a method specifically addressing these issues termed Bayesian ANOVA for microarrays (BAM). The BAM approach uses a special inferential regularization known as spike-and-slab shrinkage that provides an optimal balance between total false detections and total false non-detections. This translates into more reproducible differential calls. Spike and slab shrinkage is a form of regularization achieved by using information across all genes and groups simultaneously.
Results: BAMarray™ is a graphically oriented Java-based software package that implements the BAM method for detecting differentially expressing genes in multigroup microarray experiments (up to 256 experimental groups can be analyzed). Drop-down menus allow the user to easily select between different models and to choose various run options. BAMarray™ can also be operated in a fully automated mode with preselected run options. Tuning parameters have been preset at theoretically optimal values freeing the user from such specifications. BAMarray™ provides estimates for gene differential effects and automatically estimates data adaptive, optimal cutoff values for classifying genes into biological patterns of differential activity across experimental groups. A graphical suite is a core feature of the product and includes diagnostic plots for assessing model assumptions and interactive plots that enable tracking of prespecified gene lists to study such things as biological pathway perturbations. The user can zoom in and lasso genes of interest that can then be saved for downstream analyses.
Conclusion: BAMarray™ is user friendly platform independent software that effectively and efficiently implements the BAM methodology. Classifying patterns of differential activity is greatly facilitated by a data adaptive cutoff rule and a graphical suite. BAMarray™ is licensed software freely available to academic institutions. More information can be found at http://www.bamarray.com.DNA microarrays--Data processing, Java (Computer program language), Bioinformatics, Bayesian statistical decision theory, Statistics, Information technologyubk2101StatisticsArticlesPAGE: Parametric Analysis of Gene Set Enrichment
https://academiccommons.columbia.edu/catalog/ac:194039
Kim, Seon-Young; Volsky, David Julian10.7916/D84X568JWed, 28 Jun 2017 20:59:43 +0000Background: Gene set enrichment analysis (GSEA) is a microarray data analysis method that uses predefined gene sets and ranks of genes to identify significant biological changes in microarray data sets. GSEA is especially useful when gene expression changes in a given microarray data set is minimal or moderate.
Results: We developed a modified gene set enrichment analysis method based on a parametric statistical analysis model. Compared with GSEA, the parametric analysis of gene set enrichment (PAGE) detected a larger number of significantly altered gene sets and their p-values were lower than the corresponding p-values calculated by GSEA. Because PAGE uses normal distribution for statistical inference, it requires less computation than GSEA, which needs repeated computation of the permutated data set. PAGE was able to detect significantly changed gene sets from microarray data irrespective of different Affymetrix probe level analysis methods or different microarray platforms. Comparison of two aged muscle microarray data sets at gene set level using PAGE revealed common biological themes better than comparison at individual gene level.
Conclusion: PAGE was statistically more sensitive and required much less computational effort than GSEA, it could identify significantly changed biological themes from microarray data irrespective of analysis methods or microarray platforms, and it was useful in comparison of multiple microarray data sets. We offer PAGE as a useful microarray analysis method.Bioinformatics--Methodology, Statistics, DNA microarrays--Data processing, Bioinformatics, Genetics, Biometrydjv4Pathology and Cell BiologyArticlesMedication-Wide Association Studies
https://academiccommons.columbia.edu/catalog/ac:173912
Ryan, P. B.; Madigan, David B.; Stang, P. E.; Schuemie, M. J.; Hripcsak, George M.10.7916/D8PG1PVXWed, 28 Jun 2017 20:28:18 +0000Undiscovered side effects of drugs can have a profound effect on the health of the nation, and electronic health-care databases offer opportunities to speed up the discovery of these side effects. We applied a “medication-wide association study” approach that combined multivariate analysis with exploratory visualization to study four health outcomes of interest in an administrative claims database of 46 million patients and a clinical database of 11 million patients. The technique had good predictive value, but there was no threshold high enough to eliminate false-positive findings. The visualization not only highlighted the class effects that strengthened the review of specific products but also underscored the challenges in confounding. These findings suggest that observational databases are useful for identifying potential associations that warrant further consideration but are unlikely to provide definitive evidence of causal effects.Pharmacology, Statistics, Bioinformaticsdm2418, gh13Statistics, Biomedical InformaticsArticlesLearning Theory Analysis for Association Rules and Sequential Event Prediction
https://academiccommons.columbia.edu/catalog/ac:173905
Rudin, Cynthia; Letham, Benjamin; Madigan, David B.10.7916/D82N50C1Wed, 28 Jun 2017 20:28:13 +0000We present a theoretical analysis for prediction algorithms based on association rules. As part of this analysis, we introduce a problem for which rules are particularly natural, called “sequential event prediction." In sequential event prediction, events in a sequence are revealed one by one, and the goal is to determine which event will next be revealed. The training set is a collection of past sequences of events. An example application is to predict which item will next be placed into a customer's online shopping cart, given his/her past purchases. In the context of this problem, algorithms based on association rules have distinct advantages over classical statistical and machine learning methods: they look at correlations based on subsets of co-occurring past events (items a and b imply item c), they can be applied to the sequential event prediction problem in a natural way, they can potentially handle the “cold start" problem where the training set is small, and they yield interpretable predictions. In this work, we present two algorithms that incorporate association rules. These algorithms can be used both for sequential event prediction and for supervised classification, and they are simple enough that they can possibly be understood by users, customers, patients, managers, etc. We provide generalization guarantees on these algorithms based on algorithmic stability analysis from statistical learning theory. We include a discussion of the strict minimum support threshold often used in association rule mining, and introduce an “adjusted confidence" measure that provides a weaker minimum support condition that has advantages over the strict minimum support. The paper brings together ideas from statistical learning theory, association rule mining and Bayesian analysis.Statistics, Artificial intelligencedm2418StatisticsArticlesGenerating Productive Dialogue between Consulting Statisticians and their Clients in the Pharmaceutical and Medical Research Settings
https://academiccommons.columbia.edu/catalog/ac:173832
Emir, Birol; Amaratunga, Dhammika; Beltangady, Mohan; Cabrera, Javier; Freeman, Roy; Madigan, David B.; Nguyen, Ha H.; Whalen, Edward Patrick10.7916/D8PK0D8NWed, 28 Jun 2017 20:27:47 +0000Due to the ever-increasing complexity of scientific technologies and resulting data, consulting statisticians are becoming more involved in the design, conduct, and analysis of biomedical research. This requires extensive collaboration between the consulting statistician and nonstatisticians, such as researchers, clinicians, and corporate executives. Consequently, a successful consulting career is becoming ever more dependent on the statistician's ability to effectively communicate with nonstatisticians. This is especially true when more complex, nontraditional analytical methods are required. In this paper, we examine the collaboration between statisticians and nonstatisticians from three different professional perspectives. Integrating these perspectives, we discuss ways to help the consulting statistician generate productive dialogue with clients. Finally, we examine how universities can better prepare students for careers in statistical consulting by incorporating more communication-based elements into their curriculum and by offering students ample opportunities to collaborate with nonstatisticians. Overall, we designed this exercise to help the consulting statistician generate dialogue with clients that results in more productive collaborations and a more satisfying work experience.Statistics, Bioinformatics, Medicinebe2166, dm2418, hhn2108, ew2320StatisticsArticlesBayesian Hierarchical Rule Modeling for Predicting Medical Conditions
https://academiccommons.columbia.edu/catalog/ac:173882
McCormick, Tyler H.; Rudin, Cynthia; Madigan, David B.10.7916/D8V69GP1Wed, 28 Jun 2017 20:26:51 +0000We propose a statistical modeling technique, called the Hierarchical Association Rule Model (HARM), that predicts a patient’s possible future medical conditions given the patient’s current and past history of reported conditions. The core of our technique is a Bayesian hierarchical model for selecting predictive association rules (such as “condition 1 and condition 2 → condition 3”) from a large set of candidate rules. Because this method “borrows strength” using the conditions of many similar patients, it is able to provide predictions specialized to any given patient, even when little information about the patient’s history of conditions is available.Mathematics, Statistics, Medicinethm2105, dm2418StatisticsArticlesA Hierarchical Model for Association Rule Mining of Sequential Events: An Approach to Automated Medical Symptom Prediction
https://academiccommons.columbia.edu/catalog/ac:173838
McCormick, Tyler H.; Rudin, Cynthia; Madigan, David B.10.7916/D89C6VJDWed, 28 Jun 2017 20:25:55 +0000In many healthcare settings, patients visit healthcare professionals periodically and report multiple medical conditions, or symptoms, at each encounter. We propose a statistical modeling technique, called the Hierarchical Association Rule Model (HARM), that predicts a patient’s possible future symptoms given the patient’s current and past history of reported symptoms. The core of our technique is a Bayesian hierarchical model for selecting predictive association rules (such as “symptom 1 and symptom 2 → symptom 3 ”) from a large set of candidate rules. Because this method “borrows strength” using the symptoms of many similar patients, it is able to provide predictions specialized to any given patient, even when little information about the patient’s history of symptoms is available.Mathematics, Statistics, Medicinethm2105, dm2418StatisticsArticlesAlgorithms for Sparse Linear Classifiers in the Massive Data Setting
https://academiccommons.columbia.edu/catalog/ac:173908
Balakrishnan, Suhrid; Madigan, David B.; Bartlett, Peter10.7916/D8Z0368XWed, 28 Jun 2017 20:24:23 +0000Classifiers favoring sparse solutions, such as support vector machines, relevance vector machines, LASSO-regression based classifiers, etc., provide competitive methods for classification problems in high dimensions. However, current algorithms for training sparse classifiers typically scale quite unfavorably with respect to the number of training examples. This paper proposes online and multi-pass algorithms for training sparse linear classifiers for high dimensional data. These algorithms have computational complexity and memory requirements that make learning on massive data sets feasible. The central idea that makes this possible is a straightforward quadratic approximation to the likelihood function.Statistics, Artificial intelligencedm2418StatisticsArticlesLocation Estimation in Wireless Networks: A Bayesian Approach
https://academiccommons.columbia.edu/catalog/ac:173820
Madigan, David B.; Ju, Wen-Hua; Krishnan, P.; Krishnakumar, A. S.; Zorych, Ivan10.7916/D82V2D74Wed, 28 Jun 2017 20:23:52 +0000We present a Bayesian hierarchical model for indoor location estimation in wireless networks. We demonstrate that out model achieves accuracy that is similar to other published models and algorithms. By harnessing prior knowledge, our model drastically reduces the requirement for training data as compared with existing approaches.Mathematics, Statisticsdm2418StatisticsArticlesA One-Pass Sequential Monte Carlo Method for Bayesian Analysis of Massive Datasets
https://academiccommons.columbia.edu/catalog/ac:173899
Balakrishnan, Suhrid; Madigan, David B.10.7916/D8B56GTPWed, 28 Jun 2017 20:23:50 +0000For Bayesian analysis of massive data, Markov chain Monte Carlo (MCMC) techniques often prove infeasible due to computational resource constraints. Standard MCMC methods generally require a complete scan of the dataset for each iteration. Ridgeway and Madigan (2002) and Chopin (2002b) recently presented importance sampling algorithms that combined simulations from a posterior distribution conditioned on a small portion of the dataset with a reweighting of those simulations to condition on the remainder of the dataset. While these algorithms drastically reduce the number of data accesses as compared to traditional MCMC, they still require substantially more than a single pass over the dataset. In this paper, we present "1PFS," an efficient, one-pass algorithm. The algorithm employs a simple modification of the Ridgeway and Madigan (2002) particle filtering algorithm that replaces the MCMC based "rejuvenation" step with a more efficient "shrinkage" kernel smoothing based step. To show proof-of-concept and to enable a direct comparison, we demonstrate 1PFS on the same examples presented in Ridgeway and Madigan (2002), namely a mixture model for Markov chains and Bayesian logistic regression. Our results indicate the proposed scheme delivers accurate parameter estimates while employing only a single pass through the data.Mathematics, Statisticsdm2418StatisticsArticlesAnalysis of Variance of Cross-Validation Estimators of the Generalization Error
https://academiccommons.columbia.edu/catalog/ac:173902
Markatou, Marianthi; Tian, Hong; Biswas, Shameek; Hripcsak, George M.10.7916/D86D5R2XWed, 28 Jun 2017 20:23:49 +0000This paper brings together methods from two different disciplines: statistics and machine learning. We address the problem of estimating the variance of cross-validation (CV) estimators of the generalization error. In particular, we approach the problem of variance estimation of the CV estimators of generalization error as a problem in approximating the moments of a statistic. The approximation illustrates the role of training and test sets in the performance of the algorithm. It provides a unifying approach to evaluation of various methods used in obtaining training and test sets and it takes into account the variability due to different training and test sets. For the simple problem of predicting the sample mean and in the case of smooth loss functions, we show that the variance of the CV estimator of the generalization error is a function of the moments of the random variables Y=Card(Sj ∩ Sj') and Y*=Card(Sjc ∩ Sj'c), where Sj, Sj' are two training sets, and Sjc, Sj'c are the corresponding test sets. We prove that the distribution of Y and Y* is hypergeometric and we compare our estimator with the one proposed by Nadeau and Bengio (2003). We extend these results in the regression case and the case of absolute error loss, and indicate how the methods can be extended to the classification case. We illustrate the results through simulation.Statistics, Artificial intelligencemm168, ht2031, spb2003, gh13Biostatistics, Biomedical Informatics, StatisticsArticles[Least Angle Regression]: Discussion
https://academiccommons.columbia.edu/catalog/ac:173841
Madigan, David B.; Ridgeway, Greg10.7916/D81V5C29Wed, 28 Jun 2017 20:23:33 +0000Algorithms for simultaneous shrinkage and selection in regression and classification provide attractive solutions to knotty old statistical challenges. Nevertheless, as far as we can tell, Tibshirani's Lasso algorithm has had little impact on statistical practice. Two particular reasons for this may be the relative inefficiency of the original Lasso algorithm and the relative complexity of more recent Lasso algorithms [e.g., Osborne, Presnell and Turlach (2000)]. Efron, Hastie, Johnstone and Tibshirani have provided an efficient, simple algorithm for the Lasso as well as algorithms for stagewise regression and the new least angle regression. As such this paper is an important contribution to statistical computing.Mathematics, Statisticsdm2418StatisticsArticlesCorrection: Separation and completeness properties for AMP chain graph Markov models
https://academiccommons.columbia.edu/catalog/ac:173887
Madigan, David B.; Levitz, Michael; Perlman, Michael D.10.7916/D8QF8R05Wed, 28 Jun 2017 20:23:16 +0000Correction of table 2 on page 1757 of 'Separation and completeness properties for AMP chain graph Markov models', Annals of Statistics, volume 29 (2001).Mathematics, Statisticsdm2418StatisticsArticlesBook Reviews: Principles of Data Mining. By David Hand, Heikki Mannila, and Padhraic Smyth.
https://academiccommons.columbia.edu/catalog/ac:173915
Madigan, David B.10.7916/D8DZ06D8Wed, 28 Jun 2017 20:23:08 +0000"Principles of Data Mining. By David Hand, Heikki Mannila, and Padhraic Smyth. MIT Press, Cambridge, MA, 2001. $50.00. xxxii+546 pp., hardcover. ISBN 0-262-08290-X. Is data mining the same as statistics? The distinguished authors of Principles of Data Mining struggle to make a distinction between the two subjects. In the end, what they have written is a fine applied statistics text." -- page 501Statisticsdm2418StatisticsReviewsBayesian Model Averaging: a Tutorial (with Comments by M. Clyde, David Draper and E. I. George, and a Rejoinder by the Authors)
https://academiccommons.columbia.edu/catalog/ac:173853
Hoeting, Jennifer A.; Madigan, David B.; Raftery, Adrian E.; Volinsky, Chris T.; Clyde, M.; Draper, David; George, E. I.10.7916/D84M92N7Wed, 28 Jun 2017 20:23:00 +0000Standard statistical practice ignores model uncertainty. Data analysts typically select a model from some class of models and then proceed as if the selected model had generated the data. This approach ignores the uncertainty in model selection, leading to over-confident inferences and decisions that are more risky than one thinks they are. Bayesian model averaging (BMA)provides a coherent mechanism for accounting for this model uncertainty. Several methods for implementing BMA have recently emerged. We discuss these methods and present a number of examples.In these examples, BMA provides improved out-of-sample predictive performance. We also provide a catalogue of currently available BMA software.Statisticsdm2418StatisticsArticlesSeparation and Completeness Properties for Amp Chain Graph Markov Models
https://academiccommons.columbia.edu/catalog/ac:173847
Levitz, Michael; Perlman, Michael D.; Madigan, David B.10.7916/D8X34VJGWed, 28 Jun 2017 20:23:00 +0000Pearl’s well-known d-separation criterion for an acyclic directed graph (ADG) is a pathwise separation criterion that can be used to efficiently identify all valid conditional independence relations in the Markov model determined by the graph. This paper introduces p-separation, a pathwise separation criterion that efficiently identifies all valid conditional independences under the Andersson–Madigan–Perlman (AMP) alternative Markov property for chain graphs (= adicyclic graphs), which include both ADGs and undirected graphs as special cases. The equivalence of p-separation to the augmentation criterion occurring in the AMP global Markov property is established, and p-separation is applied to prove completeness of the global Markov property for AMP chain graph models. Strong completeness of the AMP Markov property is established, that is, the existence of Markov perfect distributions that satisfy those and only those conditional independences implied by the AMP property(equivalently, by p-separation). A linear-time algorithm for determining p-separation is presented.Mathematics, Statisticsdm2418StatisticsArticlesA Characterization of Markov Equivalence Classes for Acyclic Digraphs
https://academiccommons.columbia.edu/catalog/ac:173896
Andersson, Steen A.; Madigan, David B.; Perlman, Michael D.10.7916/D8FX77J3Wed, 28 Jun 2017 20:22:39 +0000Undirected graphs and acyclic digraphs (ADG's), as well as their mutual extension to chain graphs, are widely used to describe dependencies among variables in multiviarate distributions. In particular, the likelihood functions of ADG models admit convenient recursive factorizations that often allow explicit maximum likelihood estimates and that are well suited to building Bayesian networks for expert systems. Whereas the undirected graph associated with a dependence model is uniquely determined, there may be many ADG's that determine the same dependence (i.e., Markov) model. Thus, the family of all ADG's with a given set of vertices is naturally partitioned into Markov-equivalence classes, each class being associated with a unique statistical model. Statistical procedures, such as model selection of model averaging, that fail to take into account these equivalence classes may incur substantial computational or other inefficiences. Here it is show that each Markov-equivalence class is uniquely determined by a single chain graph, the essential graph, that is itself simultaneously Markov equivalent to all ADG's in the equivalence class. Essential graphs are characterized, a polynomial-time algorithm for their construction is given, and their applications to model selection and other statistical questions are described.Mathematics, Statisticsdm2418StatisticsArticlesA Note on Equivalence Classes of Directed Acyclic Independence Graphs
https://academiccommons.columbia.edu/catalog/ac:173826
Madigan, David B.10.7916/D8TB150CWed, 28 Jun 2017 20:22:29 +0000Directed acyclic independence graphs (DAIGs) play an important role in recent developments in probabilistic expert systems and influence diagrams (Chyu [1]). The purpose of this note is to show that DAIGs can usefully be grouped into equivalence classes where the members of a single class share identical Markov properties. These equivalence classes can be identified via a simple graphical criterion. This result is particularly relevant to model selection procedures for DAIGs (see, e.g., Cooper and Herskovits [2] and Madigan and Raftery [4]) because it reduces the problem of searching among possible orientations of a given graph to that of searching among the equivalence classes.Mathematics, Statisticsdm2418StatisticsArticles[Bayesian Analysis in Expert Systems]: Comment: What's Next?
https://academiccommons.columbia.edu/catalog/ac:173856
Madigan, David B.10.7916/D8W37TFJWed, 28 Jun 2017 20:22:27 +0000"These papers represent two of the many different graphical modeling camps that have emerged from a flurry of activity in the past decade. The paper by Cox and Wermuth falls within the statistical graphical modeling camp and provides a useful generalization of that body of work. There is, of course, a price to be paid for this generality, namely that the interpretation of the graphs is more complex...The paper by Spiegelhalter, Dawid, Lauritzen and Cowell falls within the probabilistic expert system camp. This is a tour de force by researchers responsible for much of the astonishing progress in this area. Ten years ago, probabilistic models were shunned by the artificial intelligence community. That they are now widely accepted and used is due in large measure to the insights and efforts of these authors, along with other pioneers such as Judea Pearl and Peter Cheeseman..." -- page 261Mathematics, Statisticsdm2418StatisticsArticlesReporting of analyses from randomized controlled trials with multiple arms: a systematic review
https://academiccommons.columbia.edu/catalog/ac:180137
Ravaud, Philippe; Perrodeau, Elodie; Boutron, Isabelle; Baron, Gabriel10.7916/D837772TWed, 28 Jun 2017 20:20:55 +0000Background: Multiple-arm randomized trials can be more complex in their design, data analysis, and result reporting than two-arm trials. We conducted a systematic review to assess the reporting of analyses in reports of randomized controlled trials (RCTs) with multiple arms. Methods: The literature in the MEDLINE database was searched for reports of RCTs with multiple arms published in 2009 in the core clinical journals. Two reviewers extracted data using a standardized extraction form. Results: In total, 298 reports were identified. Descriptions of the baseline characteristics and outcomes per group were missing in 45 reports (15.1%) and 48 reports (16.1%), respectively. More than half of the articles (n = 171, 57.4%) reported that a planned global test comparison was used (that is, assessment of the global differences between all groups), but 67 (39.2%) of these 171 articles did not report details of the planned analysis. Of the 116 articles reporting a global comparison test, 12 (10.3%) did not report the analysis as planned. In all, 60% of publications (n = 180) described planned pairwise test comparisons (that is, assessment of the difference between two groups), but 20 of these 180 articles (11.1%) did not report the pairwise test comparisons. Of the 204 articles reporting pairwise test comparisons, the comparisons were not planned for 44 (21.6%) of them. Less than half the reports (n = 137; 46%) provided baseline and outcome data per arm and reported the analysis as planned. Conclusions: Our findings highlight discrepancies between the planning and reporting of analyses in reports of multiple-arm trials.Statistics, Medical sciencespr2341EpidemiologyArticlesCopy number variation genotyping using family information
https://academiccommons.columbia.edu/catalog/ac:180080
Darvishi, Katayoon; Mills, Ryan E.; Lee, Charles; Raby, Benjamin A.; Chu, Jen-hwa; Rogers, Angela; Ionita-Laza, Iuliana10.7916/D8HD7T0DWed, 28 Jun 2017 20:20:16 +0000Background: In recent years there has been a growing interest in the role of copy number variations (CNV) in genetic diseases. Though there has been rapid development of technologies and statistical methods devoted to detection in CNVs from array data, the inherent challenges in data quality associated with most hybridization techniques remains a challenging problem in CNV association studies. Results: To help address these data quality issues in the context of family-based association studies, we introduce a statistical framework for the intensity-based array data that takes into account the family information for copy-number assignment. The method is an adaptation of traditional methods for modeling SNP genotype data that assume Gaussian mixture model, whereby CNV calling is performed for all family members simultaneously and leveraging within family-data to reduce CNV calls that are incompatible with Mendelian inheritance while still allowing de-novo CNVs. Applying this method to simulation studies and a genome-wide association study in asthma, we find that our approach significantly improves CNV calls accuracy, and reduces the Mendelian inconsistency rates and false positive genotype calls. The results were validated using qPCR experiments. Conclusions: In conclusion, we have demonstrated that the use of family information can improve the quality of CNV calling and hopefully give more powerful association test of CNVs.Genetics, Statisticsii2135Mailman School of Public HealthArticlesHelping the decision maker effectively promote various experts’ views into various optimal solutions to China’s institutional problem of health care provider selection.
https://academiccommons.columbia.edu/catalog/ac:180132
Tang, Liyang10.7916/D8BP0147Wed, 28 Jun 2017 20:20:05 +0000Background: The main aim of China’s Health Care System Reform was to help the decision maker find the optimal solution to China’s institutional problem of health care provider selection. A pilot health care provider research system was recently organized in China’s health care system, and it could efficiently collect the data for determining the optimal solution to China’s institutional problem of health care provider selection from various experts, then the purpose of this study was to apply the optimal implementation methodology to help the decision maker effectively promote various experts’ views into various optimal solutions to this problem under the support of this pilot system. Methods: After the general framework of China’s institutional problem of health care provider selection was established, this study collaborated with the National Bureau of Statistics of China to commission a large-scale 2009 to 2010 national expert survey (n = 3,914) through the organization of a pilot health care provider research system for the first time in China, and the analytic network process (ANP) implementation methodology was adopted to analyze the dataset from this survey. Results: The market-oriented health care provider approach was the optimal solution to China’s institutional problem of health care provider selection from the doctors’ point of view; the traditional government’s regulation-oriented health care provider approach was the optimal solution to China’s institutional problem of health care provider selection from the pharmacists’ point of view, the hospital administrators’ point of view, and the point of view of health officials in health administration departments; the public private partnership (PPP) approach was the optimal solution to China’s institutional problem of health care provider selection from the nurses’ point of view, the point of view of officials in medical insurance agencies, and the health care researchers’ point of view. Conclusions: The data collected through a pilot health care provider research system in the 2009 to 2010 national expert survey could help the decision maker effectively promote various experts’ views into various optimal solutions to China’s institutional problem of health care provider selection.Business, StatisticsBusinessArticlesExpressionPlot: a web-based framework for analysis of RNA-Seq and microarray gene expression data
https://academiccommons.columbia.edu/catalog/ac:183139
Friedman, Brad; Maniatis, Tom10.7916/D82J6979Wed, 28 Jun 2017 20:18:15 +0000RNA-Seq and microarray platforms have emerged as important tools for detecting changes in gene expression and RNA processing in biological samples. We present ExpressionPlot, a software package consisting of a default back end, which prepares raw sequencing or Affymetrix microarray data, and a web-based front end, which offers a biologically centered interface to browse, visualize, and compare different data sets. Download and installation instructions, a user's manual, discussion group, and a prototype are available at http://expressionplot.comStatistics, Bioinformaticstm2472Biochemistry and Molecular BiophysicsArticlesProspect Theory as Efficient Perceptual Distortion
https://academiccommons.columbia.edu/catalog/ac:167407
Woodford, Michael10.7916/D8T43R03Wed, 28 Jun 2017 17:07:08 +0000The paper proposes a theory of efficient perceptual distortions, in which the statistical relation between subjective perceptions and the objective state minimizes the error of the state estimate, subject to a constraint on information processing capacity. The theory is shown to account for observed limits to the accuracy of visual perception, and then postulated to apply to perception of options in economic choice situations as well. When applied to choice between lotteries, it implies reference-dependent valuations, and predicts both risk-aversion with respect to gains and risk-seeking with respect to losses, as in the prospect theory of Kahneman and Tversky (1979).Statistics, Economics, Sociologymw2230EconomicsArticlesLearning to Believe in Sunspots
https://academiccommons.columbia.edu/catalog/ac:167710
Woodford, Michael10.7916/D85X26VBWed, 28 Jun 2017 16:57:54 +0000An adaptive learning rule is exhibited for the Azariadis (1981) overlapping generations model of a monetary economy with multiple equilibria, under which the economy may converge to a stationary sunspot equilibrium, even if agents do not initially believe that outcomes are significantly different in different "sunspot" states. The type of learning rule studied is of the "stochastic approximation" form studied by Robbins and Monro (1951);
methods for analyzing the convergence of this form of algorithm are presented that may be of use in many other contexts as well. Conditions are given under which convergence to a sunspot equilibrium occurs with probability one.Economics, Statisticsmw2230EconomicsArticlesThe Representation of Social Processes by Markov Models
https://academiccommons.columbia.edu/catalog/ac:165054
Singer, Burton; Spilerman, Seymour10.7916/D80G3W8BWed, 28 Jun 2017 16:56:20 +0000In this paper we consider a class of issues which are central to modeling social phenomena by continuous-time Markov structures. In particular, we discuss (a) embeddability, or how to determine whether observations on an empirical process could have arisen via the evolution of a continuous-time Markov structure; and (b) identification, or what to do if the observations are consistent with more than one continuous-time Markov structure. With respect to the latter topic, we discuss how to select the specific structure from the list of alternatives which should be associated with the empirical process. We point out that the issues of embeddability and identification are especially pertinent to modeling empirical processes when one has available only fragmentary data and when the observations contain "noise" or other sources of error. These characteristics, of course, describe the typical work situation of sociologists. Finally, we note the type of situation in which a continuous-time model is the proper structure to employ and indicate that issues analogous to the ones we describe here apply to modeling social processes with discrete-time structures.Sociology, Statisticsss50SociologyArticlesOn the relationship between total ozone and atmospheric dynamics and chemistry at mid-latitudes – Part 1: Statistical models and spatial fingerprints of atmospheric dynamics and chemistry
https://academiccommons.columbia.edu/catalog/ac:161210
Frossard, L.; Rieder, Harald; Ribatet, M.; Staehelin, J.; Maeder, J. A.; Di Rocco, S.; Davison, A. C.; Peter, T.10.7916/D86M3HVNTue, 27 Jun 2017 20:08:50 +0000We use statistical models for mean and extreme values of total column ozone to analyze "fingerprints" of atmospheric dynamics and chemistry on long-term ozone changes at northern and southern mid-latitudes on grid cell basis. At each grid cell, the r-largest order statistics method is used for the analysis of extreme events in low and high total ozone (termed ELOs and EHOs, respectively), and an autoregressive moving average (ARMA) model is used for the corresponding mean value analysis. In order to describe the dynamical and chemical state of the atmosphere, the statistical models include important atmospheric covariates: the solar cycle, the Quasi-Biennial Oscillation (QBO), ozone depleting substances (ODS) in terms of equivalent effective stratospheric chlorine (EESC), the North Atlantic Oscillation (NAO), the Antarctic Oscillation (AAO), the El Niño/Southern Oscillation (ENSO), and aerosol load after the volcanic eruptions of El Chichón and Mt. Pinatubo. The influence of the individual covariates on mean and extreme levels in total column ozone is derived on a grid cell basis. The results show that "fingerprints", i.e., significant influence, of dynamical and chemical features are captured in both the "bulk" and the tails of the statistical distribution of ozone, respectively described by mean values and EHOs/ELOs. While results for the solar cycle, QBO, and EESC are in good agreement with findings of earlier studies, unprecedented spatial fingerprints are retrieved for the dynamical covariates. Column ozone is enhanced over Labrador/Greenland, the North Atlantic sector and over the Norwegian Sea, but is reduced over Europe, Russia and the Eastern United States during the positive NAO phase, and vice-versa during the negative phase. The NAO's southern counterpart, the AAO, strongly influences column ozone at lower southern mid-latitudes, including the southern parts of South America and the Antarctic Peninsula, and the central southern mid-latitudes. Results for both NAO and AAO confirm the importance of atmospheric dynamics for ozone variability and changes from local/regional to global scales.Statistics, Atmospheric chemistry, Atmospherehr2302Lamont-Doherty Earth ObservatoryArticlesAnalyzing Postdisaster Surveillance Data: The Effect of the Statistical Method
https://academiccommons.columbia.edu/catalog/ac:157474
DiMaggio, Charles J.; Galea, Sandro; Abramson, David M.10.7916/D8G4513DTue, 27 Jun 2017 17:52:15 +0000Data from existing administrative databases and ongoing surveys or surveillance methods may prove indispensable after mass traumas as a way of providing information that may be useful to emergency planners and practitioners. The analytic approach, however, may affect exposure prevalence estimates and measures of association. We compare Bayesian hierarchical modeling methods to standard survey analytic techniques for survey data collected in the aftermath of a terrorist attack. Estimates for the prevalence of exposure to the terrorist attacks of September 11, 2001, varied by the method chosen. Bayesian hierarchical modeling returned the lowest estimate for exposure prevalence with a credible interval spanning nearly 3 times the range of the confidence intervals (CIs) associated with both unadjusted and survey procedures. Bayesian hierarchical modeling also returned a smaller point estimate for measures of association, although in this instance the credible interval was tighter than that obtained through survey procedures. Bayesian approaches allow a consideration of preexisting assumptions about survey data, and may offer potential advantages, particularly in the uncertain environment of postterrorism and disaster settings. Additional comparative analyses of existing data are necessary to guide our ability to use these techniques in future incidents.Emergency management, Statisticscjd11, sg822, dma3National Center for Disaster PreparednessArticlesR2WinBUGS: A Package for Running WinBUGS from R
https://academiccommons.columbia.edu/catalog/ac:154734
Sturtz, Sibylle; Ligges, Uwe; Gelman, Andrew E.10.7916/D80C55HHTue, 27 Jun 2017 15:43:29 +0000The R2WinBUGS package provides convenient functions to call WinBUGS from R. It automatically writes the data and scripts in a format readable by WinBUGS for processing in batch mode, which is possible since version 1.4. After the WinBUGS process has finished, it is possible either to read the resulting data into R by the package itself—which gives a compact graphical summary of inference and convergence diagnostics—or to use the facilities of the coda package for further analyses of the output. Examples are given to demonstrate the usage of this package.Statisticsag389StatisticsArticlesMultiple Imputation with Diagnostics (mi) in R: Opening Windows into the Black Box
https://academiccommons.columbia.edu/catalog/ac:154731
Su, Yu-Sung; Gelman, Andrew E.; Hill, Jennifer; Yajima, Masanao10.7916/D8VQ3CD3Tue, 27 Jun 2017 15:43:28 +0000Our mi package in R has several features that allow the user to get inside the imputation process and evaluate the reasonableness of the resulting models and imputations. These features include: choice of predictors, models, and transformations for chained imputation models; standard and binned residual plots for checking the fit of the conditional distributions used for imputation; and plots for comparing the distributions of observed and imputed data. In addition, we use Bayesian models and weakly informative prior distributions to construct more stable estimates of imputation models. Our goal is to have a demonstration package that (a) avoids many of the practical problems that arise with existing multivariate imputation programs, and (b) demonstrates state-of-the-art diagnostics that can be applied more generally and can be incorporated into the software of others.Statisticsag389StatisticsArticlesBayesian Statistical Pragmatism
https://academiccommons.columbia.edu/catalog/ac:154737
Gelman, Andrew E.10.7916/D8MC98QJTue, 27 Jun 2017 15:39:20 +0000I agree with Rob Kass’ point that we can and should make use of statistical methods developed under different philosophies, and I am happy to take the opportunity to elaborate on some of his arguments.Statisticsag389StatisticsArticlesSegregation in Social Networks Based on Acquaintanceship and Trust
https://academiccommons.columbia.edu/catalog/ac:154740
DiPrete, Thomas A.; Gelman, Andrew E.; McCormick, Tyler; Teitler, Julien O.; Zheng, Tian10.7916/D8F198DHTue, 27 Jun 2017 15:38:27 +0000Using 2006 General Social Survey data, the authors compare levels of segregation by race and along other dimensions of potential social cleavage in the contemporary United States. Americans are not as isolated as the most extreme recent estimates suggest. However, hopes that “bridging” social capital is more common in broader acquaintanceship networks than in core networks are not supported. Instead, the entire acquaintanceship network is perceived by Americans to be about as segregated as the much smaller network of close ties. People do not always know the religiosity, political ideology, family behaviors, or socioeconomic status of their acquaintances, but perceived social divisions on these dimensions are high, sometimes rivaling racial segregation in acquaintanceship networks. The major challenge to social integration today comes from the tendency of many Americans to isolate themselves from others who differ on race, political ideology, level of religiosity, and other salient aspects of social identity.Statisticstad61, ag389, thm2105, jot8, tz33StatisticsArticlesEditorial: Special Section on Statistical and Perceptual Audio Processing
https://academiccommons.columbia.edu/catalog/ac:144493
Ellis, Daniel P. W.; Raj, Bhiksha; Brown, Judith C.; Slaney, Malcolm; Smaragdis, Paris10.7916/D83T9SVCTue, 27 Jun 2017 14:12:37 +0000Human perception has always been an inspiration for automatic processing systems, not least because tasks such as speech recognition only exist because people do them—and, indeed, without that example we might wonder if they were possible at all. As computational power grows, we have increasing opportunities to model and duplicate perceptual abilities with greater fidelity, and, most importantly, based on larger and larger amounts of raw data describing both what signals exist in the real world, and how people respond to them. The power to deal with large data sets has meant that approaches that were once mere theoretical possibilities, such as exhaustive search of exponentially-sized codebooks, or real-time direct convolution of long sequences, have become increasingly practical and even unremarkable. A major consequence of this is the growth of statistical or corpus-based approaches, where complex relations, discriminations, or structures are inferred directly from example data (for instance by optimizing the parameters of a very general algorithm). An increasing number of complex tasks can be given empirically optimal solutions based on large, representative datasets. The traditional idea of perceptually-inspired processing is to develop a machine algorithm for a complex task such as melody recognition or source separation through inspiration and introspection about how individuals perform the task, and on the basis of direct psychological or neurophysiological data. The results can appear to be at odds with the statistical perspective, since perceptually-motivated work is often ad-hoc, comprising many stages whose individual contributions are difficult to separate. We believe that it is important to unify these two approaches: to employ rigorous, exhaustive techniques taking advantage of the statistics of large data sets to develop and solve perceptually-based and subjectively-defined problems. With this in mind, we organized a one-day workshop on Statistical and Perceptual Audio Processing as a satellite to the International Conference on Spoken Language Processing (ICSLP-INTERSPEECH), held in Jeju, Korea, in September 2004.Statistics, Psychophysiologyde171Electrical EngineeringArticlesMultiscale Representations for Manifold-Valued Data
https://academiccommons.columbia.edu/catalog/ac:140178
Rahman, Inam Ur; Drori, Iddo; Stodden, Victoria C.; Donoho, David L.; Schroeder, Peter10.7916/D87371F4Mon, 26 Jun 2017 21:43:59 +0000We describe multiscale representations for data observed on equispaced grids and taking values in manifolds such as: the sphere S2, the special orthogonal group SO(3), the positive definite matrices SPD(n), and the Grassmann manifolds G(n, k). The representations are based on the deployment of Deslauriers-Dubuc and Average Interpolating pyramids "in the tangent plane" of such manifolds, using the Exp and Log maps of those manifolds. The representations provide "wavelet coefficients" which can be thresholded, quantized, and scaled much as traditional wavelet coefficients. Tasks such as compression, noise removal, contrast enhancement, and stochastic simulation are facilitated by this representation. The approach applies to general manifolds, but is particularly suited to the manifolds we consider, i.e. Riemanian symmetric spaces, such as Sn-1, SO(n), G(n, k), where the Exp and Log maps are effectively computable. Applications to manifold-valued data sources of a geometric nature (motion, orientation, diffusion) seem particularly immediate. A software toolbox, SymmLab, can reproduce the results discussed in this paper.Statisticsvcs2115StatisticsArticlesA Flexible Bayesian Generalized Linear Model for Dichotomous Response Data with an Application to Text Categorization
https://academiccommons.columbia.edu/catalog/ac:173817
Eyheramendy, Susana; Madigan, David B.10.7916/D86M34ZFMon, 26 Jun 2017 20:40:43 +0000We present a class of sparse generalized linear models that include probit and logistic regression as special cases and offer some extra flexibility. We provide an EM algorithm for learning the parameters of these models from data. We apply our method in text classification and in simulated data and show that our method outperforms the logistic and probit models and also the elastic net, in general by a substantial margin.Mathematics, Statisticsdm2418StatisticsChapters (layout features)Mathematical Representations of Development Theories
https://academiccommons.columbia.edu/catalog/ac:168029
Singer, Burton; Spilerman, Seymour10.7916/D8NP22DSMon, 26 Jun 2017 20:39:04 +0000In this chapter we explore the consequences of particular stage linkage structures for the evolution of a population. We first argue the importance of constructing dynamic models of development theories and show the implications of various stage connections for population movements. A second focus concerns inverse problems: How the stage linkage structure may be recovered from survey data of the kind collected by developmental psychologists.Developmental psychology, Statisticsss50SociologyChapters (layout features)When Does Non-Negative Matrix Factorization Give a Correct Decomposition into Parts?
https://academiccommons.columbia.edu/catalog/ac:140175
Donoho, David L.; Stodden, Victoria C.10.7916/D88D05N7Mon, 26 Jun 2017 20:25:26 +0000We interpret non-negative matrix factorization geometrically, as the problem of finding a simplicial cone which contains a cloud of data points and which is contained in the positive orthant. We show that under certain conditions, basically requiring that some of the data are spread across the faces of the positive orthant, there is a unique such simplicial cone. We give examples of synthetic image articulation databases which obey these conditions; these require separated support and factorial sampling. For such databases there is a generative model in terms of "parts" and NMF correctly identifies the "parts". We show that our theoretical results are predictive of the performance of published NMF code, by running the published algorithms on one of our synthetic image articulation databases.Statisticsvcs2115StatisticsArticlesFast <em>l</em>1 Minimization for Genomewide Analysis of mRNA Lengths
https://academiccommons.columbia.edu/catalog/ac:140172
Drori, Iddo; Stodden, Victoria C.; Hurowitz, Evan H.10.7916/D80V8P4RMon, 26 Jun 2017 20:25:25 +0000Application of the virtual northern method to human mRNA allows us to systematically measure transcript length on a genome-wide scale [1]. Characterization of RNA transcripts by length provides a measurement which complements cDNA sequencing. We have robustly extracted the lengths of the transcripts expressed by each gene for comparison with the Unigene, Refseq, and H-Invitational databases [2, 3]. Obtaining an accurate probability for each peak requires performing multiple bootstrap simulations, each involving a deconvolution operation which is equivalent to finding the sparsest non-negative solution of an underdetermined system of equations. This process is computationally intensive for a large number of simulations and genes. In this contribution we present an efficient approximation method which is faster than general purpose solvers by two orders of magnitude, and in practice reduces our processing time from a week to hours.Genetics, Statisticsvcs2115StatisticsArticlesBreakdown Point of Model Selection When the Number of Variables Exceeds the Number of Observations
https://academiccommons.columbia.edu/catalog/ac:140168
Donoho, David L.; Stodden, Victoria C.10.7916/D84M9DXZMon, 26 Jun 2017 20:25:24 +0000The classical multivariate linear regression problem assumes p variables X1, X2, ... , Xp and a response vector y, each with n observations, and a linear relationship between the two: y = X beta + z, where z ~ N(0, sigma2). We point out that when p > n, there is a breakdown point for standard model selection schemes, such that model selection only works well below a certain critical complexity level depending on n/p. We apply this notion to some standard model selection algorithms (Forward Stepwise, LASSO, LARS) in the case where pGtn. We find that 1) the breakdown point is well-de ned for random X-models and low noise, 2) increasing noise shifts the breakdown point to lower levels of sparsity, and reduces the model recovery ability of the algorithm in a systematic way, and 3) below breakdown, the size of coefficient errors follows the theoretical error distribution for the classical linear model.Statisticsvcs2115StatisticsArticlesHigher-order Properties of Approximate Estimators
https://academiccommons.columbia.edu/catalog/ac:199766
Kristensen, Dennis; Salanie, Bernard10.7916/D8KK9BVXWed, 21 Jun 2017 14:27:44 +0000Many modern estimation methods in econometrics approximate an objective function, for instance, through simulation or discretization. These approximations typically affect both bias and variance of the resulting estimator. We first provide a higher-order expansion of such “approximate” estimators that takes into account the errors due to the use of approximations. We show how a Newton-Raphson adjustment can reduce the impact of approximations. Then we use our expansions to develop inferential tools that take into account approximation errors: we propose adjustments of the approximate estimator that remove its first-order bias and adjust its standard errors. These corrections apply to a class of approximate estimators that includes all known simulation-based procedures. A Monte Carlo simulation on the mixed logic model shows that our proposed adjustments can yield significant improvements at a low computational cost.Econometrics, Estimation theory, Statistics, Economics, Mathematics, Computer sciencedk2313, bs2237EconomicsReportsSemi-convergence of an Iterative Algorithm
https://academiccommons.columbia.edu/catalog/ac:194857
Vasilaky, Kathryn N.10.7916/D8SJ1KFXWed, 21 Jun 2017 14:26:51 +0000An iterative method is introduced for solving noisy, ill-conditioned inverse problems. Analysis of the semi-convergence behavior identifies three error components - iteration error, noise error, and initial guess error. A derived expression explains how the three errors are related to each other relative to the number of iterations. The Standard Tikhonov regularization method is just the first iteration of the iterative method and the derived noise damping filter is a generalization of the Standard Tikhonov filter. The derived filter is a function two parameters, a regularization parameter and the iteration number parameter. The new method is tested on image reconstruction from projections simulated data set.Iterative methods (Mathematics), Filters (Mathematics), Inverse problems (Differential equations), Statistics, Mathematicsknv4Earth InstituteReportsHierarchical Bayes models for daily rainfall time series at multiple locations from heterogenous data sources
https://academiccommons.columbia.edu/catalog/ac:199595
Shirley, Kenneth; Vasilaky, Kathryn N.; Greatrex, Helen L.; Osgood, Daniel E.10.7916/D8QF8SZ4Wed, 21 Jun 2017 14:26:48 +0000We estimate a Hierarchical Bayesian models for daily rainfall that incorporates two novelties for estimating spatial and temporal correlations. We estimate the within site time series correlations for a particular rainfall site using multiple data sources at a given location, and we estimate the across site covariance in rainfall based on location distance. Previous rainfall models have captured cross site correlations as a functions of site specific distances, but not within site correlations across multiple data sources, and not both aspects simultaneously. Further, we incorporate information on the technology used (satellite versus rain gauge) in our estimations, which is also a novel addition. This methodology has far reaching applications in providing more accurate and complex weather insurance contracts based combining information from multiple data sources from a single site, a crucial improvement in the face of climate change. Secondly, the modeling extends to many other data contexts where multiple datasources exist for a given event or variable where both within and between series covariances can be estimated over time.Rain and rainfall--Forecasting, Computer simulation, Rain and rainfall--Mathematical models, Statistics, Mathematics, Meteorologyknv4, hlg2124, do2126Earth Institute, International Research Institute for Climate and SocietyReportsHigher-order Properties of Approximate Estimators
https://academiccommons.columbia.edu/catalog/ac:188409
Kristensen, Dennis; Salanie, Bernard10.7916/D89886BKWed, 21 Jun 2017 13:58:02 +0000Many modern estimation methods in econometrics approximate an objective function, for instance, through simulation or discretization. These approximations typically affect both bias and variance of the resulting estimator. We first provide a higher-order expansion of such "approximate" estimators that takes into account the errors due to the use of approximations. We show how a Newton-Raphson adjustment can reduce the impact of approximations. Then we use our expansions to develop inferential tools that take into account approximation errors: we propose adjustments of the approximate estimator that remove its first-order bias and adjust its standard errors. These corrections apply to a class of approximate estimators that includes all known simulation-based procedures. A Monte Carlo simulation on the mixed logit model shows that our proposed adjustments can yield spectacular improvements at a low computational cost.Statistics, Economics, Mathematics, Computer sciencedk2313, bs2237EconomicsReports