Academic Commons Search Results
https://academiccommons.columbia.edu/catalog?action=index&controller=catalog&f%5Bdepartment_facet%5D%5B%5D=Statistics&format=rss&fq%5B%5D=has_model_ssim%3A%22info%3Afedora%2Fldpd%3AContentAggregator%22&q=&rows=500&sort=record_creation_date+desc
Academic Commons Search Resultsen-usStochastic Differential Equations and Strict Local Martingales
https://academiccommons.columbia.edu/catalog/ac:jsxksn02x6
Qiu, Lisha10.7916/D8F4911QWed, 20 Dec 2017 17:13:14 +0000In this thesis, we address two problems arising from the application of stochastic differential equations (SDEs). The first one pertains to the detection of asset bubbles, where the price process solves an SDE. We combine the strict local martingale model together with a statistical tool to instantaneously check the existence and severity of asset bubbles through the asset’s historical price process. Our approach assumes that the price process of interest is a CEV process. We relate the exponent parameter in the CEV process to an asset bubble by studying the future expectation and the running maximum of the CEV process. The detection of asset bubbles then boils down to the estimation of the exponent. With a dynamic linear regression model, inference on the exponent can be carried out using historical price data. Estimation of the volatility and calibration of the parameters in the dynamic linear regression model are also studied. When using SDEs in practice, for example, in the detection of asset bubbles, one often would like to simulate its paths using the Euler scheme to study the behavior of the solution. The second part of this thesis focuses on the convergence property of the Euler scheme under the assumption that the coefficients of the SDE are locally Lipschitz and that the solution has no finite explosion. We prove that if a numerical scheme converges uniformly on any compact time set (UCP) in probability with a certain rate under the globally Lipschitz condition, then when the globally Lipschitz condition is replaced with a locally Lipschitz one plus a no finite explosion condition, UCP convergence with the same rate holds. One contribution of this thesis is the proof of √n-weak convergence of the asymptotic normalized error process. The limit error process is also provided. We further study the boundedness for the second moment of the weak limit process and its running maximum under both the globally Lipschitz and the locally Lipschitz conditions. The convergence of the Euler scheme in the sense of approximating expectations of functionals is also studied under the locally Lipschitz conditionStatistics, Mathematics, Stochastic differential equations, Martingales (Mathematics), Convergencelq2141StatisticsThesesTopics in Computational Bayesian Statistics With Applications to Hierarchical Models in Astronomy and Sociology
https://academiccommons.columbia.edu/catalog/ac:8kprr4xgzr
Sahai, Swupnil10.7916/D83R15HQThu, 09 Nov 2017 23:12:54 +0000This thesis includes three parts. The overarching theme is how to analyze structured hierarchical data, with applications to astronomy and sociology. The first part discusses how expectation propagation can be used to parallelize the computation when fitting big hierarchical bayesian models. This methodology is then used to fit a novel, nonlinear mixture model to ultraviolet radiation from various regions of the observable universe. The second part discusses how the Stan probabilistic programming language can be used to numerically integrate terms in a hierarchical bayesian model. This technique is demonstrated on supernovae data to significantly speed up convergence to the posterior distribution compared to a previous study that used a Gibbs-type sampler. The third part builds a formal latent kernel representation for aggregate relational data as a way to more robustly estimate the mixing characteristics of agents in a network. In particular, the framework is applied to sociology surveys to estimate, as a function of ego age, the age and sex composition of the personal networks of individuals in the United States.Statistics, Astronomy, Sociology, Bayesian statistical decision theory, Multilevel models (Statistics)sks2196StatisticsThesesExpansion of a filtration with a stochastic process: a high frequency trading perspective
https://academiccommons.columbia.edu/catalog/ac:tb2rbnzs9t
Neufcourt, Léo10.7916/D8571QKPFri, 13 Oct 2017 22:18:55 +0000A theory of expansion of filtrations has been developed since the 1970s to model dynamic probabilistic problems with asymmetric information. It has found a special echo in mathematical finance around the concept of insider trading, which has appeared in return very convenient for expressing the abstract properties of augmentations of filtrations. Research has historically focused on two particular classes of expansions, initial and progressive expansions, corresponding to additional information generated respectively by a random variable and a random time. Although they can reproduce some stylized facts in the insider trading paradigm, those two types of expansions are too restrictive to model quantitatively dynamic phenomenons of contemporary interest such as the topical high-frequency trading. In order to model such a continuous flow of information Kchia and Protter (2015) introduce augmentations of filtrations where the additional information is generated by a stochastic process.
This thesis complements the pioneering work of Kchia and Protter (2015) with an analysis of the information drift appearing in the transformation of semimartingales, which leads to a quantitative valuation of the additional information. In the preliminary chapters we introduce the general framework of expansions of filtrations and present the information drift as a key proxy to the value of information by characterizing its existence as a no-arbitrage condition and expressing problems the value increase of optimization problems associated with additional information as one of its integrals. The theoretical core of this thesis is formed by two series of convergence theorems for semimartingales and their information drifts under a new topology on filtrations, from which we derive the transformation of semimartingales when the filtration is augmented with a stochastic process as well as a computational method to estimate the information drift. We finally study several dynamical examples of anticipative expansions of a Brownian filtration with stochastic processes, where the information drift does or does not exist, and set the foundations for an ongoing application to estimating the advantage of high-frequency traders on the general market.Statistics, Mathematics, Financeln2294StatisticsThesesEssays on Matching and Weighting for Causal Inference in Observational Studies
https://academiccommons.columbia.edu/catalog/ac:k3j9kd51dt
Resa Juárez, María de los Angeles10.7916/D8959W4HFri, 13 Oct 2017 22:16:12 +0000This thesis consists of three papers on matching and weighting methods for causal inference. The first paper conducts a Monte Carlo simulation study to evaluate the performance of multivariate matching methods that select a subset of treatment and control observations. The matching methods studied are the widely used nearest neighbor matching with propensity score calipers, and the more recently proposed methods, optimal matching of an optimally chosen subset and optimal cardinality matching. The main findings are: (i) covariate balance, as measured by differences in means, variance ratios, Kolmogorov-Smirnov distances, and cross-match test statistics, is better with cardinality matching since by construction it satisfies balance requirements; (ii) for given levels of covariate balance, the matched samples are larger with cardinality matching than with the other methods; (iii) in terms of covariate distances, optimal subset matching performs best; (iv) treatment effect estimates from cardinality matching have lower RMSEs, provided strong requirements for balance, specifically, fine balance, or strength-k balance, plus close mean balance. In standard practice, a matched sample is considered to be balanced if the absolute differences in means of the covariates across treatment groups are smaller than 0.1 standard deviations. However, the simulation results suggest that stronger forms of balance should be pursued in order to remove systematic biases due to observed covariates when a difference in means treatment effect estimator is used. In particular, if the true outcome model is additive then marginal distributions should be balanced, and if the true outcome model is additive with interactions then low-dimensional joints should be balanced.
The second paper focuses on longitudinal studies, where marginal structural models (MSMs) are widely used to estimate the effect of time-dependent treatments in the presence of time-dependent confounders. Under a sequential ignorability assumption, MSMs yield unbiased treatment effect estimates by weighting each observation by the inverse of the probability of their observed treatment sequence given their history of observed covariates. However, these probabilities are typically estimated by fitting a propensity score model, and the resulting weights can fail to adjust for observed covariates due to model misspecification. Also, these weights tend to yield very unstable estimates if the predicted probabilities of treatment are very close to zero, which is often the case in practice. To address both of these problems, instead of modeling the probabilities of treatment, a design-based approach is taken and weights of minimum variance that adjust for the covariates across all possible treatment histories are directly found. For this, the role of weighting in longitudinal studies of treatment effects is analyzed, and a convex optimization problem that can be solved efficiently is defined. Unlike standard methods, this approach makes evident to the investigator the limitations imposed by the data when estimating causal effects without extrapolating. A simulation study shows that this approach outperforms standard methods, providing less biased and more precise estimates of time-varying treatment effects in a variety of settings. The proposed method is used on Chilean educational data to estimate the cumulative effect of attending a private subsidized school, as opposed to a public school, on students’ university admission tests scores.
The third paper is centered on observational studies with multi-valued treatments. Generalizing methods for matching and stratifying to accommodate multi-valued treatments has proven to be a complex task. A natural way to address confounding in this case is by weighting the observations, typically by the inverse probability of treatment weights (IPTW). As in the MSMs case, these weights can be highly variable and produce unstable estimates due to extreme weights. In addition, model misspecification, small sample sizes, and truncation of extreme weights can cause the weights to fail to adjust appropriately for observed confounders. The conditions the weights need to satisfy in order to provide close to unbiased treatment effect estimates with a reduced variability are determined and the convex optimization problem that can be solved in polynomial time to obtain them is defined. A simulation study with different settings is conducted to compare the proposed weighting scheme to IPTW, including generalized propensity score estimation methods that also consider explicitly the covariate balance problem in the probability estimation process. The applicability of the methods to continuous treatments is also tested. The results show that directly targeting balance with the weights, instead of focusing on estimating treatment assignment probabilities, provides the best results in terms of bias and root mean square error of the treatment effect estimator. The effects of the intensity level of the 2010 Chilean earthquake on posttraumatic stress disorder are estimated using the proposed methodology.Statistics, Inference, Statistical matching, Probabilitiesmdr2146StatisticsThesesEfficient Estimation of the Expectation of a Latent Variable in the Presence of Subject-Specific Ancillaries
https://academiccommons.columbia.edu/catalog/ac:cz8w9ghx4p
Mittel, Louis Buchalter10.7916/D8JW8SFBFri, 13 Oct 2017 16:18:26 +0000Latent variables are often included in a model in order to capture the diversity among subjects in a population. Sometimes the distribution of these latent variables are of principle interest. In studies where sequences of observations are taken from subjects, ancillary variables, such as the number of observations provided by each subject, usually also vary between subjects. The goal here is to understand efficient estimation of the expectation of the latent variable in the presence of these subject-specific ancillaries.
Unbiased estimation and efficient estimation of the expectation of the latent parameter depend on the dependence structure of these three subject-specific components: latent variable, sequence of observations, and ancillary. This dissertation considers estimation under two dependence configurations. In Chapter 3, efficiency is studied under the model in which no assumptions are made about the joint distribution of the latent variable and the subject-specific ancillary. Chapter 4 treats the setting where the ancillary variable and the latent variable are independent.Statistics, Latent variables, Estimation theorylbm2126StatisticsThesesMulti-scale approaches for high-speed imaging and analysis of large neural populations
https://academiccommons.columbia.edu/catalog/ac:2280gb5mmb
Friedrich, Johannes; Yang, Weijian; Soudry, Daniel; Mu, Yu; Ahrens, Misha B.; Yuste, Rafael; Peterka, Darcy S.; Paninski, Liam10.7916/D8P84QDKThu, 05 Oct 2017 19:56:12 +0000Progress in modern neuroscience critically depends on our ability to observe the activity of large neuronal populations with cellular spatial and high temporal resolution. However, two bottlenecks constrain efforts towards fast imaging of large populations. First, the resulting large video data is challenging to analyze. Second, there is an explicit tradeoff between imaging speed, signal-to-noise, and field of view: with current recording technology we cannot image very large neuronal populations with simultaneously high spatial and temporal resolution. Here we describe multi-scale approaches for alleviating both of these bottlenecks. First, we show that spatial and temporal decimation techniques based on simple local averaging provide order-of-magnitude speedups in spatiotemporally demixing calcium video data into estimates of single-cell neural activity. Second, once the shapes of individual neurons have been identified at fine scale (e.g., after an initial phase of conventional imaging with standard temporal and spatial resolution), we find that the spatial/temporal resolution tradeoff shifts dramatically: after demixing we can accurately recover denoised fluorescence traces and deconvolved neural activity of each individual neuron from coarse scale data that has been spatially decimated by an order of magnitude. This offers a cheap method for compressing this large video data, and also implies that it is possible to either speed up imaging significantly, or to “zoom out” by a corresponding factor to image order-of-magnitude larger neuronal populations with minimal loss in accuracy or temporal resolution.Brain--Imaging, Neurosciences, Neurons, Computational biologyjf2954, wy2221, rmy5, dp2403, lmp2107Statistics, Biological Sciences, Zuckerman Mind Brain Behavior InstituteArticlesEmpirical Bayes, Bayes factors and deoxyribonucleic acid fingerprinting
https://academiccommons.columbia.edu/catalog/ac:2jm63xsj4b
Basu, Ruma10.7916/D8J67VGBWed, 04 Oct 2017 22:15:57 +0000The central theme in this thesis is Empirical Bayes. It starts off with application of Bayes and Empirical Bayes methods to deoxyribonucleic acid fingerprinting. Different Bayes factors are obtained and an alternative Bayes factor using the method of Savage is studied both for normal and non- normal priors. It then moves on to deeper methodological aspects of Empirical Bayes theory. A 1983 conjecture by Carl Morris on the parametric empirical Bayes prediction intervals for the normal regression model is studied and an improvement suggested. Carlin and Louis’ (1996) parametric empirical Bayes prediction interval for the same model is also dealt with analytically while their approach had been primarily numerical. It is seen that both of these intervals have the same coverage probability up to a certain order of approximation and they have the same expected length up to the same order of approximation. Both the intervals are equal tailed up to the same order of approximation. Then the corrected proof of an important published result by Datta, Ghosh and Mukerjee (2000) is provided using first principles of probability matching. This result is relevant to our work on parametric empirical Bayes prediction intervals.Statistics, DNA fingerprinting, BioinformaticsStatisticsThesesDistributionally Robust Performance Analysis with Applications to Mine Valuation and Risk
https://academiccommons.columbia.edu/catalog/ac:g4f4qrfj84
Dolan, Christopher James10.7916/D8QJ7VSCFri, 29 Sep 2017 22:18:31 +0000We consider several problems motivated by issues faced in the mining industry. In recent years, it has become clear that mines have substantial tail risk in the form of environmental disasters, and this tail risk is not incorporated into common pricing and risk models. However, data sets of the extremal climate behavior that drive this risk are very small, and generally inadequate for properly estimating the tail behavior. We propose a data-driven methodology that comes up with reasonable worst-case scenarios, given the data size constraints, and we incorporate this into a real options based model for the valuation of mines. We propose several different iterations of the model, to allow the end-user to choose the degree to which they wish to specify the financial consequences of the disaster scenario. Next, in order to perform a risk analysis on a portfolio of mines, we propose a method of estimating the correlation structure of high-dimensional max-stable processes. Using the techniques of (Liu Et al, 2017) to map the relationship between normal correlations and max-stable correlations, we can then use techniques inspired by (Bickel et al, 2008, Liu et al, 2014, Rothman et al, 2009) to estimate the underlying correlation matrix, while preserving a sparse, positive-definite structure. The correlation matrices are then used in the calculation of model-robust risk metrics (VaR, CVAR) using the the Sample-Out-of-Sample methodology (Blanchet and Kang, 2017). We conclude with several new techniques that were developed in the field of robust performance analysis, that while not directly applied to mining, were motivated by our studies into distributionally robust optimization in order to address these problems.Statistics, Mine valuation--Statistical methods, Robust statisticscjd2119StatisticsThesesDistributionally Robust Optimization and its Applications in Machine Learning
https://academiccommons.columbia.edu/catalog/ac:9cnp5hqc0q
Kang, Yang10.7916/D8WD4C1RFri, 25 Aug 2017 22:13:06 +0000The goal of Distributionally Robust Optimization (DRO) is to minimize the cost of running a stochastic system, under the assumption that an adversary can replace the underlying baseline stochastic model by another model within a family known as the distributional uncertainty region. This dissertation focuses on a class of DRO problems which are data-driven, which generally speaking means that the baseline stochastic model corresponds to the empirical distribution of a given sample.
One of the main contributions of this dissertation is to show that the class of data-driven DRO problems that we study unify many successful machine learning algorithms, including square root Lasso, support vector machines, and generalized logistic regression, among others. A key distinctive feature of the class of DRO problems that we consider here is that our distributional uncertainty region is based on optimal transport costs. In contrast, most of the DRO formulations that exist to date take advantage of a likelihood based formulation (such as Kullback-Leibler divergence, among others). Optimal transport costs include as a special case the so-called Wasserstein distance, which is popular in various statistical applications.
The use of optimal transport costs is advantageous relative to the use of divergence-based formulations because the region of distributional uncertainty contains distributions which explore samples outside of the support of the empirical measure, therefore explaining why many machine learning algorithms have the ability to improve generalization. Moreover, the DRO representations that we use to unify the previously mentioned machine learning algorithms, provide a clear interpretation of the so-called regularization parameter, which is known to play a crucial role in controlling generalization error. As we establish, the regularization parameter corresponds exactly to the size of the distributional uncertainty region.
Another contribution of this dissertation is the development of statistical methodology to study data-driven DRO formulations based on optimal transport costs. Using this theory, for example, we provide a sharp characterization of the optimal selection of regularization parameters in machine learning settings such as square-root Lasso and regularized logistic regression.
Our statistical methodology relies on the construction of a key object which we call the robust Wasserstein profile function (RWP function). The RWP function similar in spirit to the empirical likelihood profile function in the context of empirical likelihood (EL). But the asymptotic analysis of the RWP function is different because of a certain lack of smoothness which arises in a suitable Lagrangian formulation.
Optimal transport costs have many advantages in terms of statistical modeling. For example, we show how to define a class of novel semi-supervised learning estimators which are natural companions of the standard supervised counterparts (such as square root Lasso, support vector machines, and logistic regression). We also show how to define the distributional uncertainty region in a purely data-driven way. Precisely, the optimal transport formulation allows us to inform the shape of the distributional uncertainty, not only its center (which given by the empirical distribution). This shape is informed by establishing connections to the metric learning literature. We develop a class of metric learning algorithms which are based on robust optimization. We use the robust-optimization-based metric learning algorithms to inform the distributional uncertainty region in our data-driven DRO problem. This means that we endow the adversary with additional which force him to spend effort on regions of importance to further improve generalization properties of machine learning algorithms.
In summary, we explain how the use of optimal transport costs allow constructing what we call double-robust statistical procedures. We test all of the procedures proposed in this paper in various data sets, showing significant improvement in generalization ability over a wide range of state-of-the-art procedures.
Finally, we also discuss a class of stochastic optimization algorithms of independent interest which are particularly useful to solve DRO problems, especially those which arise when the distributional uncertainty region is based on optimal transport costs.Statistics, Robust optimization, Machine learning, Mathematical optimizationyk2606StatisticsThesesA unified view of high-dimensional bridge regression
https://academiccommons.columbia.edu/catalog/ac:3xsj3tx96j
Weng, Haolei10.7916/D82V2THPTue, 15 Aug 2017 22:36:12 +0000In many application areas ranging from bioinformatics to imaging, we are interested in recovering a sparse coefficient in the high-dimensional linear model, when the sample size n is comparable to or less than the dimension p. One of the most popular classes of estimators is the Lq-regularized least squares (LQLS), a.k.a. bridge regression. There have been extensive studies towards understanding the performance of the best subset selection (q=0), LASSO (q=1) and ridge (q=2), three widely known estimators from the LQLS family. This thesis aims at giving a unified view of LQLS for all the non-negative values of q. In contrast to most existing works which obtain order-wise error bounds with loose constants, we derive asymptotically exact error formulas characterized through a series of fixed point equations. A delicate analysis of the fixed point equations enables us to gain fruitful insights into the statistical properties of LQLS across the entire spectrum of Lq-regularization. Our work not only validates the scope of folklore understanding of Lq-minimization, but also provides new insights into high-dimensional statistics as a whole. We will elaborate on our theoretical findings mainly from parameter estimation point of view. At the end of the thesis, we briefly mention bridge regression for variable selection and prediction.
We start by considering the parameter estimation problem and evaluate the performance of LQLS by characterizing the asymptotic mean square error (AMSE). The expression we derive for AMSE does not have explicit forms and hence is not useful in comparing LQLS for different values of q, or providing information in evaluating the effect of relative sample size n/p or the sparsity level of the coefficient. To simplify the expression, we first perform the phase transition (PT) analysis, a widely accepted analysis diagram, of LQLS. Our results reveal some of the limitations and misleading features of the PT framework. To overcome these limitations, we propose the small-error analysis of LQLS. Our new analysis framework not only sheds light on the results of the phase transition analysis, but also describes when phase transition analysis is reliable, and presents a more accurate comparison among different Lq-regularizations.
We then extend our low noise sensitivity analysis to linear models without sparsity structure. Our analysis, as a generalization of phase transition analysis, reveals a clear picture of bridge regression for estimating generic coefficients. Moreover, by a simple transformation we connect our low-noise sensitivity framework to the classical asymptotic regime in which n/p goes to infinity, and give some insightful implications beyond what classical asymptotic analysis of bridge regression can offer.
Furthermore, following the same idea of the new analysis framework, we are able to obtain an explicit characterization of AMSE in the form of second-order expansions under the large noise regime. The expansions provide us some intriguing messages. For example, ridge will outperform LASSO in terms of estimating sparse coefficients when the measurement noise is large.
Finally, we present a short analysis of LQLS, for the purpose of variable selection and prediction. We propose a two-stage variable selection technique based on the LQLS estimators, and describe its superiority and close connection to parameter estimation. For prediction, we illustrate the intricate relation between the tuning parameter selection for optimal in-sample prediction and optimal parameter estimation.Statistics, Regression analysis, Mathematicshw2375StatisticsThesesContributions to Semiparametric Inference to Biased-Sampled and Financial Data
https://academiccommons.columbia.edu/catalog/ac:177018
Sit, Tony10.7916/D81R72W2Wed, 09 Aug 2017 15:54:08 +0000This thesis develops statistical models and methods for the analysis of life-time and financial data under the umbrella of semiparametric framework. The first part studies the use of empirical likelihood on Levy processes that are used to model the dynamics exhibited in the financial data. The second part is a study of inferential procedure for survival data collected under various biased sampling schemes in transformation and the accelerated failure time models. During the last decade Levy processes with jumps have received increasing popularity for modelling market behaviour for both derivative pricing and risk management purposes. Chan et al. (2009) introduced the use of empirical likelihood methods to estimate the parameters of various diffusion processes via their characteristic functions which are readily available in most cases. Return series from the market are used for estimation. In addition to the return series, there are many derivatives actively traded in the market whose prices also contain information about parameters of the underlying process. This observation motivates us to combine the return series and the associated derivative prices observed at the market so as to provide a more reflective estimation with respect to the market movement and achieve a gain in efficiency. The usual asymptotic properties, including consistency and asymptotic normality, are established under suitable regularity conditions. We performed simulation and case studies to demonstrate the feasibility and effectiveness of the proposed method. The second part of this thesis investigates a unified estimation method for semiparametric linear transformation models and accelerated failure time model under general biased sampling schemes. The methodology proposed is first investigated in Paik (2009) in which the length-biased case is considered for transformation models. The new estimator is obtained from a set of counting process-based unbiased estimating equations, developed through introducing a general weighting scheme that offsets the sampling bias. The usual asymptotic properties, including consistency and asymptotic normality, are established under suitable regularity conditions. A closed-form formula is derived for the limiting variance and the plug-in estimator is shown to be consistent. We demonstrate the unified approach through the special cases of left truncation, length-bias, the case-cohort design and variants thereof. Simulation studies and applications to real data sets are also presented.Statisticsts2500StatisticsThesesDetecting Dependence Change Points in Multivariate Time Series with Applications in Neuroscience and Finance
https://academiccommons.columbia.edu/catalog/ac:177012
Cribben, Ivor John10.7916/D8JQ1CF0Wed, 09 Aug 2017 15:53:53 +0000In many applications there are dynamic changes in the dependency structure between multivariate time series. Two examples include neuroscience and finance. The second and third chapters focus on neuroscience and introduce a data-driven technique for partitioning a time course into distinct temporal intervals with different multivariate functional connectivity patterns between a set of brain regions of interest (ROIs). The technique, called Dynamic Connectivity Regression (DCR), detects temporal change points in functional connectivity and estimates a graph, or set of relationships between ROIs, for data in the temporal partition that falls between pairs of change points. Hence, DCR allows for estimation of both the time of change in connectivity and the connectivity graph for each partition, without requiring prior knowledge of the nature of the experimental design. Permutation and bootstrapping methods are used to perform inference on the change points. In the second chapter of this work, we focus on multi-subject data while in the third chapter, we concentrate on single-subject data and extend the DCR methodology in two ways: (i) we alter the algorithm to make it more accurate for individual subject data with a small number of observations and (ii) we perform inference on the edges or connections between brain regions in order to reduce the number of false positives in the graphs. We also discuss a Likelihood Ratio test to compare precision matrices (inverse covariance matrices) across subjects as well as a test across subjects on the single edges or partial correlations in the graph. In the final chapter of this work, we turn to a finance setting. We use the same DCR technique to detect changes in dependency structure in multivariate financial time series for situations where both the placement and number of change points is unknown. In this setting, DCR finds the dependence change points and estimates an undirected graph representing the relationship between time series within each interval created by pairs of adjacent change points. A shortcoming of the proposed DCR methodology is the presence of an excessive number of false positive edges in the undirected graphs, especially when the data deviates from normality. Here we address this shortcoming by proposing a procedure for performing inference on the edges, or partial dependencies between time series, that effectively removes false positive edges. We also discuss two robust estimation procedures based on ranks and the tlasso (Finegold and Drton, 2011) technique, which we contrast with the glasso technique used by DCR.Statisticsijc2104StatisticsThesesTensor Analysis Reveals Distinct Population Structure that Parallels the Different Computational Roles of Areas M1 and V1
https://academiccommons.columbia.edu/catalog/ac:205830
Seely, Jeffrey Scott; Kaufman, Matthew T.; Ryu, Stephen I.; Shenoy, Krishna V.; Cunningham, John Patrick; Churchland, Mark M.10.7916/D8N29XF1Wed, 05 Jul 2017 13:43:49 +0000Cortical firing rates frequently display elaborate and heterogeneous temporal structure. One often wishes to compute quantitative summaries of such structure—a basic example is the frequency spectrum—and compare with model-based predictions. The advent of large-scale population recordings affords the opportunity to do so in new ways, with the hope of distinguishing between potential explanations for why responses vary with time. We introduce a method that assesses a basic but previously unexplored form of population-level structure: when data contain responses across multiple neurons, conditions, and times, they are naturally expressed as a third-order tensor. We examined tensor structure for multiple datasets from primary visual cortex (V1) and primary motor cortex (M1). All V1 datasets were ‘simplest’ (there were relatively few degrees of freedom) along the neuron mode, while all M1 datasets were simplest along the condition mode. These differences could not be inferred from surface-level response features. Formal considerations suggest why tensor structure might differ across modes. For idealized linear models, structure is simplest across the neuron mode when responses reflect external variables, and simplest across the condition mode when responses reflect population dynamics. This same pattern was present for existing models that seek to explain motor cortex responses. Critically, only dynamical models displayed tensor structure that agreed with the empirical M1 data. These results illustrate that tensor structure is a basic feature of the data. For M1 the tensor structure was compatible with only a subset of existing models.Motor cortex, Calculus of tensors, Neurons, Visual cortex, Neurosciences, Biometryjss2219, jpc2181, mc3502Neurobiology and Behavior, Statistics, NeuroscienceArticlesMachine learning and data mining in complex genomic data a review on the lessons learned in Genetic Analysis Workshop Nineteen
https://academiccommons.columbia.edu/catalog/ac:206639
Konig, Inke R.; Auerbach, Jonathan Lyle; Gola, Damian; Held, Elizabeth; Holzinger, Emily R.; Legault, Marc Andre; Sun, Rui; Tintle, Nathan; Yang, Hsin Chou10.7916/D8HT2TZ6Wed, 05 Jul 2017 13:43:01 +0000In the analysis of current genomic data, application of machine learning and data mining techniques has become more attractive given the rising complexity of the projects. As part of the Genetic Analysis Workshop 19, approaches from this domain were explored, mostly motivated from two starting points. First, assuming an underlying structure in the genomic data, data mining might identify this and thus improve downstream association analyses. Second, computational methods for machine learning need to be developed further to efficiently deal with the current wealth of data.
In the course of discussing results and experiences from the machine learning and data mining approaches, six common messages were extracted. These depict the current state of these approaches in the application to complex genomic data. Although some challenges remain for future studies, important forward steps were taken in the integration of different data types and the evaluation of the evidence. Mining the data for underlying genetic or phenotypic structure and using this information in subsequent analyses proved to be extremely helpful and is likely to become of even greater use with more complex data sets.Machine learning, Data mining, Genomicsjla2167StatisticsArticlesEstimation of Total Body Skeletal Muscle Mass in Chinese Adults: Prediction Model by Dual-Energy X-Ray Absorptiometry
https://academiccommons.columbia.edu/catalog/ac:207364
Zhao, Xinyu; Wang, ZiMian; Zhang, Junyi; Hua, Jianming; He, Wei; Zhu, Shankuan10.7916/D8MS3ZGJWed, 05 Jul 2017 13:40:41 +0000Background: There are few reports on total body skeletal muscle mass (SM) in Chinese. The objective of this study is to establish a prediction model of SM for Chinese adults.
Methodology: Appendicular lean soft tissue (ALST) was measured by dual energy X-ray absorptiometry (DXA) and SM by magnetic resonance image (MRI) in 66 Chinese adults (52 men and 14 women). Images of MRI were segmented into compartments including intermuscular adipose tissue (IMAT) and IMAT-free SM. Regression was used to fit the prediction model SM = c + k × ALST. Age and gender were adjusted in the fitted model. The piece-wise linear function was performed to further explore the effect of age on SM. ‘Leave-One-Out Cross Validation’ was utilized to evaluate the prediction performance. The significance of observed differences between predicted and actual SM was tested by t test and the level of agreement was assessed by the method of Bland and Altman.
Results: Men had greater ALST and IMAT-free SM than women. ALST was the primary predictor and highly correlated with IMAT-free SM (R2 = 0.94, SEE = 1.11 kg, P<0.001). Age was an additional predictor (SM prediction model with age adjusted R2 = 0.95, SEE = 1.05 kg, P<0.001). There was a piece-wise linear relationship between age and IMAT-free SM: IMAT-free SM = 1.21×ALST−0.98, (Age <45 years) and IMAT-free SM = 1.21×ALST−0.98−0.04× (Age−45), (Age ≥45years). The prediction performance of this age-adjusted model was good due to ‘Leave-One-Out Cross Validation’. No significant difference between measured and predicted IMAT-free SM was detected.
Conclusion: Previous SM prediction model developed in multi-ethnic groups underestimated SM by 2.3% and 3.4% for Chinese men and women. A new prediction model by DXA has been established to predict SM in Chinese adults.Muscles, Chinese--Health and hygiene, Human anatomy, Musculoskeletal system, Biologyjz2299College of Physicians and Surgeons, StatisticsArticlesTime Series Modeling with Shape Constraints
https://academiccommons.columbia.edu/catalog/ac:qz612jm65v
Zhang, Jing10.7916/D84X5M55Fri, 30 Jun 2017 22:15:28 +0000This thesis focuses on the development of semiparametric estimation methods for a class of time series models using shape constraints. Many of the existing time series models assume the noise follows some known parametric distributions. Typical examples are the Gaussian and t distributions. Then the model parameters are estimated by maximizing the resultant likelihood function.
As an example, the autoregressive moving average (ARMA) models (Brockwell and Davis, 2009) assume Gaussian noise sequence and are estimated under the causal-invertible constraint by maximizing the Gaussian likelihood. Although the same estimates can also be used in the causal-invertible non-Gaussian case, they are not asymptotically optimal (Rosenblatt, 2012). Moreover, for the noncausal/noninvertible cases, the Gaussian likelihood estimation procedure is not applicable, since any second-order based methods cannot distinguish between causal-invertible and noncausal/noninvertible models (Brockwell and Davis,2009). As a result, many estimation methods for noncausal/noninvertible ARMA models assume the noise follows a known non-Gaussian distribution, like a Laplace distribution or a t distribution. To relax this distributional assumption and allow noncausal/noninvertible models, we borrow ideas from nonparametric shape-constraint density estimation and propose a semiparametric estimation procedure for general ARMA models by projecting the underlying noise distribution onto the space of log-concave measures (Cule and Samworth, 2010; Dümbgen et al., 2011). We show the maximum likelihood estimators in this semiparametric setting are consistent. In fact, the MLE is robust to the misspecification of log-concavity in cases where the true distribution of the noise is close to its log-concave projection. We derive a lower bound for the best asymptotic variance of regular estimators at rate sqrt(n) for AR models and construct a semiparametric efficient estimator.
We also consider modeling time series of counts with shape constraints. Many of the formulated models for count time series are expressed via a pair of generalized state-space equations. In this set-up, the observation equation specifies the conditional distribution of the observation Yt at time t given a state-variable Xt. For count time series, this conditional distribution is usually specified as coming from a known parametric family such as the Poisson or the Negative Binomial distribution. To relax this formal parametric framework, we introduce a concave shape constraint into the one-parameter exponential family. This essentially amounts to assuming that the reference measure is log-concave. In this fashion, we are able to extend the class of observation-driven models studied in Davis and Liu (2016). Under this formulation, there exists a stationary and ergodic solution to the state-space model. In this new modeling framework, we consider the inference problem of estimating both the parameters of the mean model and the log-concave function, corresponding to the reference measure. We then compute and maximize the likelihood function over both the parameters associated with the mean function and the reference measure subject to a concavity constraint. The estimator of the mean function and the conditional distribution are shown to be consistent and perform well compared to a full parametric model specification. The finite sample behavior of the estimators are studied via simulation and two empirical examples are provided to illustrate the methodology.Statistics, Time-series analysis--Mathematical modelsjz2300StatisticsThesesA Robust Model-free Approach for Rare Variants Association Studies Incorporating Gene-Gene and Gene-Environmental Interactions
https://academiccommons.columbia.edu/catalog/ac:203800
Fan, Ruixue; Lo, Shaw-Hwa10.7916/D84J0FF0Fri, 30 Jun 2017 18:36:57 +0000Recently more and more evidence suggest that rare variants with much lower minor allele frequencies play significant roles in disease etiology. Advances in next-generation sequencing technologies will lead to many more rare variants association studies. Several statistical methods have been proposed to assess the effect of rare variants by aggregating information from multiple loci across a genetic region and testing the association between the phenotype and aggregated genotype. One limitation of existing methods is that they only look into the marginal effects of rare variants but do not systematically take into account effects due to interactions among rare variants and between rare variants and environmental factors. In this article, we propose the summation of partition approach (SPA), a robust model-free method that is designed specifically for detecting both marginal effects and effects due to gene-gene (G×G) and gene-environmental (G×E) interactions for rare variants association studies. SPA has three advantages. First, it accounts for the interaction information and gains considerable power in the presence of unknown and complicated G×G or G×E interactions. Secondly, it does not sacrifice the marginal detection power; in the situation when rare variants only have marginal effects it is comparable with the most competitive method in current literature. Thirdly, it is easy to extend and can incorporate more complex interactions; other practitioners and scientists can tailor the procedure to fit their own study friendly. Our simulation studies show that SPA is considerably more powerful than many existing methods in the presence of G×G and G×E interactions.Genotype-environment interaction, Allelomorphism, Geneticsrf2283, shl5StatisticsArticlesPulmonary Hyperinflation and Left Ventricular Mass
https://academiccommons.columbia.edu/catalog/ac:199889
Smith, Benjamin; Kawut, Steven M.; Bluemke, David A.; Basner, Robert C.; Gomes, Antoinette S.; Hoffman, Eric; Kalhan, Ravi; Lima, Joao A. C.; Liu, Chia-Ying; Michos, Erin D.; Prince, Martin R.; Rabbani, Leroy E.; Rabinowitz, Daniel; Shimbo, Daichi; Shea, Steven J. C.; Barr, R. Graham10.7916/D8BR8S99Fri, 30 Jun 2017 16:54:36 +0000Background—Left ventricular (LV) mass is an important predictor of heart failure and cardiovascular mortality, yet determinants of LV mass are incompletely understood. Pulmonary hyperinflation in chronic obstructive pulmonary disease (COPD) may contribute to changes in intrathoracic pressure that increase LV wall stress. We therefore hypothesized that residual lung volume in COPD would be associated with greater LV mass.
Methods and Results—The Multi-Ethnic Study of Atherosclerosis (MESA) COPD Study recruited smokers 50 to 79 years of age who were free of clinical cardiovascular disease. LV mass was measured by cardiac magnetic resonance. Pulmonary function testing was performed according to guidelines. Regression models were used to adjust for age, sex, body size, blood pressure, and other cardiac risk factors. Among 119 MESA COPD Study participants, the mean age was 69±6 years, 55% were male, and 65% had COPD, mostly of mild or moderate severity. Mean LV mass was 128±34 g. Residual lung volume was independently associated with greater LV mass (7.2 g per 1-SD increase in residual volume; 95% confidence interval, 2.2–12; P=0.004) and was similar in magnitude to that of systolic blood pressure (7.6 g per 1-SD increase in systolic blood pressure; 95% confidence interval, 4.3–11; P<0.001). Similar results were observed for the ratio of LV mass to end-diastolic volume (P=0.02) and with hyperinflation measured as residual volume to total lung capacity ratio (P=0.009).
Conclusions—Pulmonary hyperinflation, as measured by residual lung volume or residual lung volume to total lung capacity ratio, is associated with greater LV mass.Heart failure, Lungs--Diseases, Obstructive, Heart--Left ventricle, Medical sciences, Epidemiology, Medicinebs2723, rcb42, mrp2102, ler8, dr105, ds2231, ss35, rgb9Medicine, Center for Behavioral Cardiovascular Health, Statistics, RadiologyArticlesA Generalizable Brain-Computer Interface (BCI) Using Machine Learning for Feature Discovery
https://academiccommons.columbia.edu/catalog/ac:192916
Nurse, Ewan S.; Karoly, Philippa J.; Grayden, David B.; Freestone, Dean R.10.7916/D8KS6R9NFri, 30 Jun 2017 00:45:30 +0000This work describes a generalized method for classifying motor-related neural signals for a brain-computer interface (BCI), based on a stochastic machine learning method. The method differs from the various feature extraction and selection techniques employed in many other BCI systems. The classifier does not use extensive a-priori information, resulting in reduced reliance on highly specific domain knowledge. Instead of pre-defining features, the time-domain signal is input to a population of multi-layer perceptrons (MLPs) in order to perform a stochastic search for the best structure. The results showed that the average performance of the new algorithm outperformed other published methods using the Berlin BCI IV (2008) competition dataset and was comparable to the best results in the Berlin BCI II (2002–3) competition dataset. The new method was also applied to electroencephalography (EEG) data recorded from five subjects undertaking a hand squeeze task and demonstrated high levels of accuracy with a mean classification accuracy of 78.9% after five-fold cross-validation. Our new approach has been shown to give accurate results across different motor tasks and signal types as well as between subjects.Brain-computer interfaces, Electroencephalography--Computer programs, Machine learning, Neurons, Neural networks (Computer science), NeurosciencesStatisticsArticlesHuman and Machine Learning in Non-Markovian Decision Making
https://academiccommons.columbia.edu/catalog/ac:192871
Clarke, Aaron Michael; Friedrich, Johannes; Tartaglia, Elisa M.; Herzog, Michael H.; Marchesotti, Silvia; Senn, Walter10.7916/D8G44Q1DFri, 30 Jun 2017 00:43:41 +0000Humans can learn under a wide variety of feedback conditions. Reinforcement learning (RL), where a series of rewarded decisions must be made, is a particularly important type of learning. Computational and behavioral studies of RL have focused mainly on Markovian decision processes, where the next state depends on only the current state and action. Little is known about non-Markovian decision making, where the next state depends on more than the current state and action. Learning is non-Markovian, for example, when there is no unique mapping between actions and feedback. We have produced a model based on spiking neurons that can handle these non-Markovian conditions by performing policy gradient descent. Here, we examine the model’s performance and compare it with human learning and a Bayes optimal reference, which provides an upper-bound on performance. We find that in all cases, our population of spiking neurons model well-describes human performance.Learning strategies, Reinforcement learning, Neurons, Decision making, Decision making--Mathematical models, Markov processes--Mathematical models, Markov processes, Education, Psychologyjf2954StatisticsArticlesDistributed Bayesian Computation and Self-Organized Learning in Sheets of Spiking Neurons with Local Lateral Inhibition
https://academiccommons.columbia.edu/catalog/ac:192253
Buesing, Lars; Habenschuss, Stefan; Bill, Johannes; Nessler, Bernhard; Maass, Wolfgang; Legenstein, Robert10.7916/D8862G4XThu, 29 Jun 2017 23:25:22 +0000During the last decade, Bayesian probability theory has emerged as a framework in cognitive science and neuroscience for describing perception, reasoning and learning of mammals. However, our understanding of how probabilistic computations could be organized in the brain, and how the observed connectivity structure of cortical microcircuits supports these calculations, is rudimentary at best. In this study, we investigate statistical inference and self-organized learning in a spatially extended spiking network model, that accommodates both local competitive and large-scale associative aspects of neural information processing, under a unified Bayesian account. Specifically, we show how the spiking dynamics of a recurrent network with lateral excitation and local inhibition in response to distributed spiking input, can be understood as sampling from a variational posterior distribution of a well-defined implicit probabilistic model. This interpretation further permits a rigorous analytical treatment of experience-dependent plasticity on the network level. Using machine learning theory, we derive update rules for neuron and synapse parameters which equate with Hebbian synaptic and homeostatic intrinsic plasticity rules in a neural implementation. In computer simulations, we demonstrate that the interplay of these plasticity rules leads to the emergence of probabilistic local experts that form distributed assemblies of similarly tuned cells communicating through lateral excitatory connections. The resulting sparse distributed spike code of a well-adapted network carries compressed information on salient input features combined with prior experience on correlations among them. Our theory predicts that the emergence of such efficient representations benefits from network architectures in which the range of local inhibition matches the spatial extent of pyramidal cells that share common afferent input.Neuroplasticity, Neurons, Inhibition, Bayesian statistical decision theory, Neurosciences, Molecular biology, StatisticsStatisticsArticlesGLMLE: graph-limit enabled fast computation for fitting exponential random graph models to large social networks
https://academiccommons.columbia.edu/catalog/ac:185410
He, Ran; Zheng, Tian10.7916/D8S46QVQThu, 29 Jun 2017 03:44:14 +0000Large network, as a form of big data, has received increasing amount of attention in data science, especially for large social network, which is reaching the size of hundreds of millions, with daily interactions on the scale of billions. Thus analyzing and modeling these data to understand the connectivities and dynamics of large networks is important in a wide range of scientific fields. Among popular models, exponential random graph models (ERGMs) have been developed to study these complex networks by directly modeling network structures and features. ERGMs, however, are hard to scale to large networks because maximum likelihood estimation of parameters in these models can be very difficult, due to the unknown normalizing constant. Alternative strategies based on Markov chain Monte Carlo (MCMC) draw samples to approximate the likelihood, which is then maximized to obtain the maximum likelihood estimators (MLE). These strategies have poor convergence due to model degeneracy issues and cannot be used on large networks. Chatterjee et al. (Ann Stat 41:2428–2461, 2013) propose a new theoretical framework for estimating the parameters of ERGMs by approximating the normalizing constant using the emerging tools in graph theory—graph limits. In this paper, we construct a complete computational procedure built upon their results with practical innovations which is fast and is able to scale to large networks. More specifically, we evaluate the likelihood via simple function approximation of the corresponding ERGM’s graph limit and iteratively maximize the likelihood to obtain the MLE. We also discuss the methods of conducting likelihood ratio test for ERGMs as well as related issues. Through simulation studies and real data analysis of two large social networks, we show that our new method outperforms the MCMC-based method, especially when the network size is large (more than 100 nodes). One limitation of our approach, inherited from the limitation of the result of Chatterjee et al. (Ann Stat 41:2428–2461, 2013), is that it works only for sequences of graphs with a positive limiting density, i.e., dense graphs.Statisticsrh2528, tz33StatisticsArticlesA partition-based approach to identify gene-environment interactions in genome wide association studies
https://academiccommons.columbia.edu/catalog/ac:184908
Fan, Ruixue; Huang, Chien-Hsun; Hu, Inchi; Wang, Haitan; Zheng, Tian; Lo, Shaw-Hwa10.7916/D8542MGFThu, 29 Jun 2017 03:42:19 +0000It is believed that almost all common diseases are the consequence of complex interactions between genetic markers and environmental factors. However, few such interactions have been documented to date. Conventional statistical methods for detecting gene and environmental interactions are often based on the linear regression model, which assumes a linear interaction effect. In this study, we propose a nonparametric partition-based approach that is able to capture complex interaction patterns. We apply this method to the real data set of hypertension provided by Genetic Analysis Workshop 18. Compared with the linear regression model, the proposed approach is able to identify many additional variants with significant gene-environmental interaction effects. We further investigate one single-nucleotide polymorphism identified by our method and show that its gene-environmental interaction effect is, indeed, nonlinear. To adjust for the family dependence of phenotypes, we apply different permutation strategies and investigate their effects on the outcomes.Genetics, Biometryrf2283, ch2526, tz33, shl5StatisticsArticlesConsidering interactive effects in the identification of influential regions with extremely rare variants via fixed bin approach
https://academiccommons.columbia.edu/catalog/ac:184914
Agne, Michael; Huang, Chien-Hsun; Hu, Inchi; Wang, Haitian; Zheng, Tian; Lo, Shaw-Hwa10.7916/D8445KCHThu, 29 Jun 2017 03:42:17 +0000In this study, we analyze the Genetic Analysis Workshop 18 (GAW18) data to identify regions of single-nucleotide polymorphisms (SNPs), which significantly influence hypertension status among individuals. We have studied the marginal impact of these regions on disease status in the past, but we extend the method to deal with environmental factors present in data collected over several exam periods. We consider the respective interactions between such traits as smoking status and age with the genetic information and hope to augment those genetic regions deemed influential marginally with those that contribute via an interactive effect. In particular, we focus only on rare variants and apply a procedure to combine signal among rare variants in a number of "fixed bins" along the chromosome. We extend the procedure in Agne et al to incorporate environmental factors by dichotomizing subjects via traits such as smoking status and age, running the marginal procedure among each respective category (i.e., smokers or nonsmokers), and then combining their scores into a score for interaction. To avoid overlap of subjects, we examine each exam period individually. Out of a possible 629 fixed-bin regions in chromosome 3, we observe that 11 show up in multiple exam periods for gene-smoking score. Fifteen regions exhibit significance for multiple exam periods for gene-age score, with 4 regions deemed significant for all 3 exam periods. The procedure pinpoints SNPs in 8 "answer" genes, with 5 of these showing up as significant in multiple testing schemes (Gene-Smoking, Gene-Age for Exams 1, 2, and 3).Genetics, Biometrymra2110, ch2526, tz33, shl5StatisticsArticlesA dual-clustering framework for association screening with whole genome sequencing data and longitudinal traits
https://academiccommons.columbia.edu/catalog/ac:184911
Lui, Ying; Huang, Chien-Hsun; Hu, Inchi; Zheng, Tian; Lo, Shaw-Hwa10.7916/D8N29VVKThu, 29 Jun 2017 03:41:56 +0000Current sequencing technology enables generation of whole genome sequencing data sets that contain a high density of rare variants, each of which is carried by, at most, 5% of the sampled subjects. Such variants are involved in the etiology of most common diseases in humans. These diseases can be studied by relevant longitudinal phenotype traits. Tests for association between such genotype information and longitudinal traits allow the study of the function of rare variants in complex human disorders. In this paper, we propose an association-screening framework that highlights the genotypic differences observed on rare variants and the longitudinal nature of phenotypes. In particular, both variants within a gene and longitudinal phenotypes are used to create partitions of subjects. Association between the 2 sets of constructed partitions is then evaluated. We apply the proposed strategy to the simulated data from the Genetic Analysis Workshop 18 and compare the obtained results with those from sequence kernel association test using the receiver operating characteristic curves.Genetics, Biometrych2526, tz33, shl5StatisticsArticlesDiscovering pure gene-environment interactions in blood pressure genome-wide association studies data: a two-step approach incorporating new statistics
https://academiccommons.columbia.edu/catalog/ac:184905
Wang, Maggie Haitan; Huang, Chien-Hsun; Zheng, Tian; Lo, Shaw-Hwa; Hu, Inchi10.7916/D8DN43X5Thu, 29 Jun 2017 03:41:56 +0000Environment has long been known to play an important part in disease etiology. However, not many genome-wide association studies take environmental factors into consideration. There is also a need for new methods to identify the gene-environment interactions. In this study, we propose a 2-step approach incorporating an influence measure that capturespure gene-environment effect. We found that pure gene-age interaction has a stronger association than considering the genetic effect alone for systolic blood pressure, measured by counting the number of single-nucleotide polymorphisms (SNPs)reaching a certain significance level. We analyzed the subjects by dividing them into two age groups and found no overlap in the top identified SNPs between them. This suggested that age might have a nonlinear effect on genetic association. Furthermore, the scores of the top SNPs for the two age subgroups were about 3times those obtained when using all subjects for systolic blood pressure. In addition, the scores of the older age subgroup were much higher than those for the younger group. The results suggest that genetic effects are stronger in older age and that genetic association studies should take environmental effects into consideration, especially age.Genetics, Biometrych2526, tz33, shl5StatisticsArticlesBayesian hierarchical graph-structured model for pathway analysis using gene expression data
https://academiccommons.columbia.edu/catalog/ac:184980
Zhou, Hui; Zheng, Tian10.7916/D8DB80QNThu, 29 Jun 2017 03:41:36 +0000In genomic analysis, there is growing interest in network structures that represent biochemistry interactions. Graph structured or constrained inference takes advantage of a known relational structure among variables to introduce smoothness and reduce complexity in modeling, especially for high-dimensional genomic data. There has been a lot of interest in its application in model regularization and selection. However, prior knowledge on the graphical structure among the variables can be limited and partial. Empirical data may suggest variations and modifications to such a graph, which could lead to new and interesting biological findings. In this paper, we propose a Bayesian random graph-constrained model, rGrace, an extension from the Grace model, to combine a priori network information with empirical evidence, for applications such as pathway analysis. Using both simulations and real data examples, we show that the new method, while leading to improved predictive performance, can identify discrepancy between data and a prior known graph structure and suggest modifications and updates.Biometry, Geneticshz2106, tz33StatisticsArticlesSurveying Hard-to-Reach Groups Through Sampled Respondents in a Social Network
https://academiccommons.columbia.edu/catalog/ac:185373
McCormick, Tyler H.; Zheng, Tian; He, Ran; Kolaczyk, Eric10.7916/D8Z0372NThu, 29 Jun 2017 03:41:08 +0000The sampling frame in most social science surveys misses members of certain groups, such as the homeless or individuals living with HIV. These groups are known as hard-to-reach groups. One strategy for learning about these groups, or subpopulations, involves reaching hard-to-reach group members through their social network. In this paper we compare the efficiency of two common methods for subpopulation size estimation using data from standard surveys. These designs are examples of mental link tracing designs. These designs begin with a randomly sampled set of network members (nodes) and then reach other nodes indirectly through questions asked to the sampled nodes. Mental link tracing designs cost significantly less than traditional link tracing designs, yet introduce additional sources of potential bias. We examine the influence of one such source of bias using simulation studies. We then demonstrate our findings using data from the General Social Survey collected in 2004 and 2006. Additionally, we provide survey design suggestions for future surveys incorporating such designs.Statistics, Social sciences--Researchthm2105, tz33, rh2528StatisticsArticlesLatent demographic profile estimation in hard-to-reach groups
https://academiccommons.columbia.edu/catalog/ac:184956
McCormick, Tyler H.; Zheng, Tian10.7916/D8F76BFQThu, 29 Jun 2017 03:41:07 +0000The sampling frame in most social science surveys excludes members of certain groups, known as hard-to-reach groups. These groups, or subpopulations, may be difficult to access (the homeless, e.g.), camouflaged by stigma (individuals with HIV/AIDS), or both (commercial sex workers). Even basic demographic information about these groups is typically unknown, especially in many developing nations. We present statistical models which leverage social network structure to estimate demographic characteristics of these subpopulations using Aggregated relational data (ARD), or questions of the form “How many X’s do you know?” Unlike other network-based techniques for reaching these groups, ARD require no special sampling strategy and are easily incorporated into standard surveys. ARD also do not require respondents to reveal their own group membership. We propose a Bayesian hierarchical model for estimating the demographic characteristics of hard-to-reach groups, or latent demographic profiles, using ARD. We propose two estimation techniques. First, we propose a Markov-chain Monte Carlo algorithm for existing data or cases where the full posterior distribution is of interest. For cases when new data can be collected, we propose guidelines and, based on these guidelines, propose a simple estimate motivated by a missing data approach. Using data from McCarty et al. [Human Organization 60 (2001) 28–39], we estimate the age and gender profiles of six hard-to-reach groups, such as individuals who have HIV, women who were raped, and homeless persons. We also evaluate our simple estimates using simulation studies.Statisticsthm2105, tz33StatisticsArticlesA Practical Guide to Measuring Social Structure Using Indirectly Observed Network Data
https://academiccommons.columbia.edu/catalog/ac:185370
McCormick, Tyler H.; Moussa, Amal; DiPrete, Thomas A.; Ruf, Johannes; Gelman, Andrew E.; Teitler, Julien O.; Zheng, Tian10.7916/D86H4G9DThu, 29 Jun 2017 03:41:05 +0000Aggregated relational data (ARD) are an increasingly common tool for learning about social networks through standard surveys. Recent statistical advances present social scientists with new options for analyzing such data. In this article, we propose guidelines for learning about various network processes using ARD and a template to aid practitioners. We first propose that ARD can be used to measure “social distance” between a respondent and a subpopulation (individuals named Kevin, those in prison, or those serving in the military). We then present common methods for analyzing these data and associate each of these methods with a specific way of measuring social distance, thus associating statistical tools with their underlying social science phenomena. We examine the implications of using each of these social distance measures using an Internet survey about contemporary political issues.Statistics, Social sciences--Researchthm2105, am2810, tad61, ag389, jot8, tz33Sociology, Statistics, Social WorkArticlesIdentifying rare disease variants in the Genetic Analysis Workshop 17 simulated data: a comparison of several statistical approaches
https://academiccommons.columbia.edu/catalog/ac:184928
Fan, Ruixue; Huang, Chien-Hsun; Lo, Shaw-Hwa; Zheng, Tian; Ionita-Laza, Iuliana10.7916/D89P30J1Thu, 29 Jun 2017 03:40:31 +0000Genome-wide association studies have been successful at identifying common disease variants associated with complex diseases, but the common variants identified have small effect sizes and account for only a small fraction of the estimated heritability for common diseases. Theoretical and empirical studies suggest that rare variants, which are much less frequent in populations and are poorly captured by single-nucleotide polymorphism chips, could play a significant role in complex diseases. Several new statistical methods have been developed for the analysis of rare variants, for example, the combined multivariate and collapsing method, the weighted-sum method and a replication-based method. Here, we apply and compare these methods to the simulated data sets of Genetic Analysis Workshop 17 and thereby explore the contribution of rare variants to disease risk. In addition, we investigate the usefulness of extreme phenotypes in identifying rare risk variants when dealing with quantitative traits. Finally, we perform a pathway analysis and show the importance of the vascular endothelial growth factor pathway in explaining different phenotypes.Genetics, Biometryrf2283, ch2526, shl5, tz33, ii2135Statistics, BiostatisticsArticlesAssociation screening for genes with multiple potentially rare variants: an inverse-probability weighted clustering approach
https://academiccommons.columbia.edu/catalog/ac:184921
Liu, Ying; Huang, Chien-Hsun; Hu, Inchi; Zheng, Tian; Lo, Shaw-Hwa10.7916/D8BP01QVThu, 29 Jun 2017 03:40:27 +0000Both common variants and rare variants are involved in the etiology of most complex diseases in humans. Developments in sequencing technology have led to the identification of a high density of rare variant single-nucleotide polymorphisms (SNPs) on the genome, each of which affects only at most 1% of the population. Genotypes derived from these SNPs allow one to study the involvement of rare variants in common human disorders. Here, we propose an association screening approach that treats genes as units of analysis. SNPs within a gene are used to create partitions of individuals, and inverse-probability weighting is used to overweight genotypic differences observed on rare variants. Association between a phenotype trait and the constructed partition is then evaluated. We consider three association tests (one-way ANOVA, chi-square test, and the partition retention method) and compare these strategies using the simulated data from the Genetic Analysis Workshop 17. Several genes that contain causal SNPs were identified by the proposed method as top genes.Genetics, Biometryyl2802, ch2526, tz33, shl5Biostatistics, StatisticsArticlesNew insights into old methods for identifying causal rare variants
https://academiccommons.columbia.edu/catalog/ac:184925
Wang, Haitian; Huang, Chien-Hsun; Zheng, Tian; Hu, Inchi; Lo, Shaw-Hwa10.7916/D8K64H03Thu, 29 Jun 2017 03:40:26 +0000The advance of high-throughput next-generation sequencing technology makes possible the analysis of rare variants. However, the investigation of rare variants in unrelated-individuals data sets faces the challenge of low power, and most methods circumvent the difficulty by using various collapsing procedures based on genes, pathways, or gene clusters. We suggest a new way to identify causal rare variants using the F-statistic and sliced inverse regression. The procedure is tested on the data set provided by the Genetic Analysis Workshop 17 (GAW17). After preliminary data reduction, we ranked markers according to their F-statistic values. Top-ranked markers were then subjected to sliced inverse regression, and those with higher absolute coefficients in the most significant sliced inverse regression direction were selected. The procedure yields good false discovery rates for the GAW17 data and thus is a promising method for future study on rare variants.Genetics, Biometrych2526, tz33, shl5StatisticsArticlesIdentifying influential regions in extremely rare variants using a fixed-bin approach
https://academiccommons.columbia.edu/catalog/ac:184917
Agne, Michael; Huang, Chien-Hsun; Hu, Inchi; Wang, Haitian; Zheng, Tian; Lo, Shaw-Hwa10.7916/D8VM4B5WThu, 29 Jun 2017 03:40:24 +0000In this study, we analyze the Genetic Analysis Workshop 17 data to identify regions of single-nucleotide polymorphisms (SNPs) that exhibit a significant influence on response rate (proportion of subjects with an affirmative affected status), called the affected ratio, among rare variants. Under the null hypothesis, the distribution of rare variants is assumed to be uniform over case (affected) and control (unaffected) subjects. We attempt to pinpoint regions where the composition is significantly different between case and control events, specifically where there are unusually high numbers of rare variants among affected subjects. We focus on private variants, which require a degree of “collapsing” to combine information over several SNPs, to obtain meaningful results. Instead of implementing a gene-based approach, where regions would vary in size and sometimes be too small to achieve a strong enough signal, we implement a fixed-bin approach, with a preset number of SNPs per region, relying on the assumption that proximity and similarity go hand in hand. Through application of 100-SNP and 30-SNP fixed bins, we identify several most influential regions, which later are seen to contain some of the causal SNPs. The 100- and 30-SNP approaches detected seven and three causal SNPs among the most significant regions, respectively, with two overlapping SNPs located in the ELAVL4 gene, reported by both procedures.Genetics, Biometrymra2110, ch2526, tz33, shl5StatisticsArticlesHow many people do you know?: Efficiently estimating personal network size
https://academiccommons.columbia.edu/catalog/ac:185367
Zheng, Tian; Salganik, Matthew J.; McCormick, Tyler H.10.7916/D8FX78BTThu, 29 Jun 2017 03:40:03 +0000In this paper we develop a method to estimate both individual social network size (i.e., degree) and the distribution of network sizes in a population by asking respondents how many people they know in specific subpopulations (e.g., people named Michael). Building on the scale-up method of Killworth et al. and other previous attempts to estimate individual network size, we propose a latent non-random mixing model which resolves three known problems with previous approaches. As a byproduct, our method also provides estimates of the rate of social mixing between population groups. We demonstrate the model using a sample of 1,370 adults originally collected by McCarty et al. (2001). Based on insights developed during the statistical modeling, we conclude by offering practical guidelines for the design of future surveys to estimate social network size. Most importantly, we show that if the first names to be asked about are chosen properly, the simple scale-up degree estimates can enjoy the same bias-reduction as that from the our more complex latent non-random mixing model.Statistics, Social sciences--Researchtz33, thm2105StatisticsArticlesOn Bootstrap Tests of Symmetry About an Unknown Median
https://academiccommons.columbia.edu/catalog/ac:184965
Zheng, Tian; Gastwirth, Joseph L.10.7916/D8X9296PThu, 29 Jun 2017 03:40:03 +0000It is important to examine the symmetry of an underlying distribution before applying some statistical procedures to a data set. For example, in the Zuni School District case, a formula originally developed by the Department of Education trimmed 5% of the data symmetrically from each end. The validity of this procedure was questioned at the hearing by Chief Justice Roberts. Most tests of symmetry (even nonparametric ones) are not distribution free in finite sample sizes. Hence, using asymptotic distribution may not yield an accurate type I error rate or/and loss of power in small samples. Bootstrap resampling from a symmetric empirical distribution function fitted to the data is proposed to improve the accuracy of the calculated p-value of several tests of symmetry. The results show that the bootstrap method is superior to previously used approaches relying on the asymptotic distribution of the tests that assumed the data come from a normal distribution. Incorporating the bootstrap estimate in a recently proposed test due to Miao, Gel and Gastwirth (2006) preserved its level and shows it has reasonable power properties on the family of distribution evaluated.Statisticstz33StatisticsArticlesProtecting Minorities in Large Binary Elections: A Test of Storable Votes Using Field Data
https://academiccommons.columbia.edu/catalog/ac:182487
Casella, Alessandra M.; Gelman, Andrew E.; Ehrenberg, Shuky; Shen, Jie10.7916/D8KH0M4QThu, 29 Jun 2017 03:39:52 +0000The legitimacy of democratic systems requires the protection of minority preferences while ideally treating every voter equally. During the 2006 student elections at Columbia University, we asked voters to rank the importance of different contests and to choose where to cast a single extra "bonus vote," had one been available — a simple version of Storable Votes. We then constructed distributions of intensities and electoral outcomes and estimated the probable impact of the bonus vote through bootstrapping techniques. The bonus vote performs well: when minority preferences are particularly intense, the minority wins at least one contest with 15-30 percent probability; when the minority wins, aggregate welfare increases with 85-95 percent probability. The paper makes two contributions: it tests the performance of storable votes in a setting where preferences were not controlled, and it suggests the use of bootstrapping techniques when appropriate replications of the data cannot be obtained.Political scienceac186, ag389Economics, StatisticsArticlesDiscovering influential variables: A method of partitions
https://academiccommons.columbia.edu/catalog/ac:184953
Chernoff, Herman; Lo, Shaw-Hwa; Zheng, Tian10.7916/D8PR7TVMThu, 29 Jun 2017 03:39:38 +0000A trend in all scientific disciplines, based on advances in technology, is the increasing availability of high dimensional data in which are buried important information. A current urgent challenge to statisticians is to develop effective methods of finding the useful information from the vast amounts of messy and noisy data available, most of which are noninformative. This paper presents a general computer intensive approach, based on a method pioneered by Lo and Zheng for detecting which, of many potential explanatory variables, have an influence on a dependent variable Y. This approach is suited to detect influential variables, where causal effects depend on the confluence of values of several variables. It has the advantage of avoiding a difficult direct analysis, involving possibly thousands of variables, by dealing with many randomly selected small subsets from which smaller subsets are selected, guided by a measure of influence I. The main objective is to discover the influential variables, rather than to measure their effects. Once they are detected, the problem of dealing with a much smaller group of influential variables should be vulnerable to appropriate analysis. In a sense, we are confining our attention to locating a few needles in a haystack.Computer science, Statisticsshl5, tz33StatisticsArticlesSelecting informative genes for discriminant analysis using multigene expression profiles.
https://academiccommons.columbia.edu/catalog/ac:184902
Yan, Xin; Zheng, Tian10.7916/D8XK8DF3Thu, 29 Jun 2017 03:39:18 +0000Gene expression data extracted from microarray experiments have been used to study the difference between mRNA abundance of genes under different conditions. In one of such experiments, thousands of genes are measured simultaneously, which provides a high-dimensional feature space for discriminating between different sample classes. However, most of these dimensions are not informative about the between-class difference, and add noises to the discriminant analysis.
In this paper we propose and study feature selection methods that evaluate the "informativeness" of a set of genes. Two measures of information based on multigene expression profiles are considered for a backward information-driven screening approach for selecting important gene features. By considering multigene expression profiles, we are able to utilize interaction information among these genes. Using a breast cancer data, we illustrate our methods and compare them to the performance of existing methods.
We illustrate in this paper that methods considering gene-gene interactions have better classification power in gene expression analysis. In our results, we identify important genes with relative large p-values from single gene tests. This indicates that these are genes with weak marginal information but strong interaction information, which will be overlooked by strategies that only examine individual genes.Biometry, Geneticstz33StatisticsArticlesComment: Quantifying the Fraction of Missing Information for Hypothesis Testing in Statistical and Genetic Studies
https://academiccommons.columbia.edu/catalog/ac:184983
Zheng, Tian; Lo, Shaw-Hwa10.7916/D84T6H8MThu, 29 Jun 2017 03:39:16 +0000The authors suggest an interesting way to measure
the fraction of missing information in the context of
hypothesis testing. The measure seeks to quantify the
impact of missing observations on the test between two
hypotheses. The amount of impact can be useful information
for applied research. An example is, in genetics,
where multiple tests of the same sort are performed
on different variables with different missing rates, and
follow-up studies may be designed to resolve missing
values in selected variables.
In this discussion, we offer our prospective views on
the use of relative information in a follow-up study.
For studies where the impact of missing observations
varies greatly across different variables and where the
investigators have the flexibility of designing studies
that can have different efforts on variables, an optimal
design may be derived using relative information measures
to improve the cost-effectiveness of the followup.Statisticstz33, shl5StatisticsArticlesGenome-wide gene-based analysis of rheumatoid arthritis-associated interaction with PTPN22 and HLA-DRB1
https://academiccommons.columbia.edu/catalog/ac:184932
Qiao, Bo; Huang, Chien-Hsun; Cong, Lei; Xie, Jun; Lo, Shaw-Hwa; Zheng, Tian10.7916/D8SQ8Z92Thu, 29 Jun 2017 03:39:14 +0000The genes PTPN22 and HLA-DRB1 have been found by a number of studies to confer an increased risk for rheumatoid arthritis (RA), which indicates that both genes play an important role in RA etiology. It is believed that they not only have strong association with RA individually, but also interact with other related genes that have not been found to have predisposing RA mutations. In this paper, we conduct genome-wide searches for RA-associated gene-gene interactions that involve PTPN22 or HLA-DRB1 using the Genetic Analysis Workshop 16 Problem 1 data from the North American Rheumatoid Arthritis Consortium. MGC13017, HSPCAL3, MIA, PTPNS1L, and IGLVI-70, which showed association with RA in previous studies, have been confirmed in our analysis.Genetics, Biometrych2526, shl5, tz33StatisticsArticlesRheumatoid arthritis-associated gene-gene interaction network for rheumatoid arthritis candidate genes
https://academiccommons.columbia.edu/catalog/ac:184935
Huang, Chien-Hsun; Cong, Lei; Xie, Jun; Qiao, Bo; Lo, Shaw-Hwa; Zheng, Tian10.7916/D8J67FTVThu, 29 Jun 2017 03:39:12 +0000Rheumatoid arthritis (RA, MIM 180300) is a chronic and complex autoimmune disease. Using the North American Rheumatoid Arthritis Consortium (NARAC) data set provided in Genetic Analysis Workshop 16 (GAW16), we used the genotype-trait distortion (GTD) scores and proposed analysis procedures to capture the gene-gene interaction effects of multiple susceptibility gene regions on RA. In this paper, we focused on 27 RA candidate gene regions (531 SNPs) based on a literature search. Statistical significance was evaluated using 1000 permutations. HLADRB1 was found to have strong marginal association with RA. We identified 14 significant interactions (p < 0.01), which were aggregated into an association network among 12 selected candidate genes PADI4, FCGR3, TNFRSF1B, ITGAV, BTLA, SLC22A4, IL3, VEGF, TNF, NFKBIL1, TRAF1-C5, and MIF. Based on our and other contributors' findings during the GAW16 conference, we further studied 24 candidate regions with 336 SNPs. We found 23 significant interactions (p-value < 0.01), nine interactions in addition to our initial findings, and the association network was extended to include candidate genes HLA-A, HLA-B, HLA-C, CTLA4, and IL6. As we will discuss in this paper, the reported possible interactions between genes may suggest potential biological activities of RA.Genetics, Biometrych2526, shl5, tz33StatisticsArticlesGenetic-linkage mapping of complex hereditary disorders to a whole-genome molecular-interaction network
https://academiccommons.columbia.edu/catalog/ac:184959
Iossifov, Ivan; Zheng, Tian; Baron, Miron; Gilliam, T. Conrad; Rzhetsky, Andrey10.7916/D85T3JD0Thu, 29 Jun 2017 03:38:55 +0000Common hereditary neurodevelopmental disorders such as autism, bipolar disorder, and schizophrenia are most likely both genetically multifactorial and heterogeneous. Because of these characteristics traditional methods for genetic analysis fail when applied to such diseases. To address the problem we propose a novel probabilistic framework that combines the standard genetic linkage formalism with whole-genome molecular-interaction data to predict pathways or networks of interacting genes that contribute to common heritable disorders. We apply the model to three large genotype–phenotype data sets, identify a small number of significant candidate genes for autism (24), bipolar disorder (21), and schizophrenia (25), and predict a number of gene targets likely to be shared among the disorders.Biometry, Geneticstz33, tcg1, ar345StatisticsArticlesPattern-based mining strategy to detect multi-locus association and gene × environment interaction
https://academiccommons.columbia.edu/catalog/ac:184941
Li, Zhong; Zheng, Tian; Califano, Andrea; Floratos, Aristidis10.7916/D8H70DQGThu, 29 Jun 2017 03:38:55 +0000As genome-wide association studies grow in popularity for the identification of genetic factors for common and rare diseases, analytical methods to comb through large numbers of genetic variants efficiently to identify disease association are increasingly in demand. We have developed a pattern-based data-mining approach to discover unlinked multilocus genetic effects for complex disease and to detect genotype × phenotype/genotype × environment interactions. On a densely mapped chromosome 18 data set for rheumatoid arthritis that was made available by Genetic Analysis Workshop 15, this method detected two potential two-locus associations as well as a putative two-locus gene × gender interaction.Genetics, Biometryzl2147, tz33, ac2248, af2202Systems Biology, Statistics, Biomedical InformaticsArticlesDiscovering interactions among BRCA1 and other candidate genes associated with sporadic breast cancer
https://academiccommons.columbia.edu/catalog/ac:184992
Lo, Shaw-Hwa; Chernoff, Herman; Cong, Lei; Ding, Yuejing; Zheng, Tian10.7916/D8CC0ZKFThu, 29 Jun 2017 03:38:54 +0000Analysis of a subset of case-control sporadic breast cancer data, [from the National Cancer Institute's Cancer Genetic Markers of Susceptibility (CGEMS) initiative], focusing on 18 breast cancer-related genes with 304 SNPs, indicates that there are many interesting interactions that form two- and three-way networks in which BRCA1 plays a dominant and central role. The apparent interactions of BRCA1 with many other genes suggests the conjecture that BRCA1 serves as a protective gene and that some mutations in it or in related genes may prevent it from carrying out this protective function even if the patients are not carriers of known cancer-predisposing BRCA1 mutations. The method of analysis features the evaluation of the effect of a gene by averaging the effects of the SNPs covered by that gene. Marginal methods that test one gene at a time fail to show any effect. That may be related to the fact that each of these 18 genes adds very little to the risk of cancer. Analysis that relates the ratio of interactions to the maximum of the first-order effects discovers significant gene pairs and triplets.
Breast cancer (MIM 114480) has complex causes. Known predisposition genes explain <15% of the breast cancer cases. It is generally believed that most sporadic breast cancers are triggered by unknown combined effects, possibly because of a large number of genes and other risk factors, each adding a small risk toward cancer etiology. Progress in seeking breast cancer genes other than BRCA1 and BRCA2 has been slow and limited because the individual risk due to each gene is small. This difficulty may be partly due to the fact that current methods rely largely on marginal information from genes studied one at a time and ignore potentially valuable information because of the interaction among multiple loci. Because each responsible gene may have a small marginal effect in causing disease, it is likely that such methods will fail to capture many responsible genes by studying a dataset where the disease may be due to a variety of different sources. The possible presence of many genes responsible for different subgroups of cancer patients may reduce the power of current methods to detect genes partly responsible for some forms of breast cancer. It is believed that methods effective in extracting interactive information from data should be developed.
What should be done when marginal effects are too weak to be detected? Our methods use interactive information from multiple sites as well as marginal information, They provide power to detect interactive genes. To test this claim and to demonstrate the practical value of these methods in real applications, we apply them to an important study: a subset of a large dataset collected from a case-control sporadic breast cancer study, focusing on gene–gene-based analysis. This partial dataset comprises 18 genes with 304 SNP markers. The application results in a number of scientific findings.
The message of this article is fourfold. First, if marginal methods fail, more powerful methods that take into account interactive information can be used effectively. We apply our proposed methods to this dataset to illustrate the detection of the interactions between genes. We point out that in our findings, none of the 18 selected genes show any detectable marginal effects that are significantly higher than those generated by random fluctuations. In other words, all of the 18 genes would be missed if only marginal methods were used.
Second, we demonstrate how to carry out a gene-based analysis by treating each gene as a basic unit while incorporating relevant information from all SNPs within that gene. Two summary test scores are proposed to quantify the strength of interactions for each pair of genes. The pairwise interactions can be extended easily. We also provide results using third-order interactions.
Third, to establish statistical significance, we generate a large number of permutations of the dependent variable (case or control) to see how the measures of interaction for the real data compare with those from the many permutations.
Finally, when these procedures are applied to the data, they lead to a number of interesting findings. It is shown that there are a substantial number of significant interactions that form a network in which BRCA1 plays a dominant role. The interactions of BRCA1 with many of the other genes suggests the conjecture that BRCA1 serves as a protective gene and that some mutations in it or in related genes may prevent it from carrying out the protective function.Biometry, Geneticsshl5, tz33StatisticsArticlesIdentification of gene interactions associated with disease from gene expression data using synergy networks
https://academiccommons.columbia.edu/catalog/ac:184938
Watkinson, John; Wang, Xiaodong; Zheng, Tian; Anastassiou, Dimitris10.7916/D81835DPThu, 29 Jun 2017 03:38:53 +0000Analysis of microarray data has been used for the inference of gene-gene interactions. If, however, the aim is the discovery of disease-related biological mechanisms, then the criterion for defining such interactions must be specifically linked to disease.
Here we present a computational methodology that jointly analyzes two sets of microarray data, one in the presence and one in the absence of a disease, identifying gene pairs whose correlation with disease is due to cooperative, rather than independent, contributions of genes, using the recently developed information theoretic measure of synergy. High levels of synergy in gene pairs indicates possible membership of the two genes in a shared pathway and leads to a graphical representation of inferred gene-gene interactions associated with disease, in the form of a "synergy network." We apply this technique on a set of publicly available prostate cancer expression data and successfully validate our results, confirming that they cannot be due to pure chance and providing a biological explanation for gene pairs with exceptionally high synergy.
Thus, synergy networks provide a computational methodology helpful for deriving "disease interactomes" from biological data. When coupled with additional biological knowledge, they can also be helpful for deciphering biological mechanisms responsible for disease.Genetics, Biometryxw2008, tz33, da8Electrical Engineering, StatisticsArticlesConstructing gene association networks for rheumatoid arthritis using the backward genotype-trait association (BGTA) algorithm
https://academiccommons.columbia.edu/catalog/ac:184950
Ding, Yuejing; Cong, Lei; Ionita-Laza, Iuliana; Lo, Shaw-Hwa; Zheng, Tian10.7916/D8Z89B92Thu, 29 Jun 2017 03:38:38 +0000Rheumatoid arthritis (RA, MIM 180300) is a common and complex inflammatory disorder. The North American Rheumatoid Arthritis Consortium (NARAC) data, as part of the Genetic Analysis Workshop 15 data, consists of both genome scan and candidate gene studies on RA patients.
We applied the backward genotype-trait association (BGTA) algorithm to capture marginal and gene × gene interaction effects of multiple susceptibility loci on RA disease status. A two-stage screening approach was used for the genome scan, whereas a comprehensive study of all possible subsets was conducted for the candidate genes. For the genome scan, we constructed an association network among 39 genetic loci that demonstrated strong signals, 19 of which have been reported in the RA literature. For the candidate genes, we found strong signals for PTPN22 and SUMO4. Based on significant association evidence, we built an association network among the loci of PTPN22, PADI4, DLG5, SLC22A4, SUMO4, and CARD15. To control for false positives, we used permutation tests to constrain the family-wise type I error rate to 1%.
Using the BGTA algorithm, we identified genetic loci and candidate genes that were associated with RA susceptibility and association networks among them. For the first time, we report possible interactions between single-nucleotide polymorphisms/genes, which may be useful for biological interpretation.Genetics, Biometryii2135, shl5, tz33Statistics, BiostatisticsArticlesTranscription activity hot spot, is it real or an artifact?
https://academiccommons.columbia.edu/catalog/ac:184944
Wang, Shuang; Zheng, Tian; Wang, Yuanjia10.7916/D808647VThu, 29 Jun 2017 03:38:38 +0000Transcription activity 'hot spots', defined as chromosome regions that contain more expression quantitative trait loci than would have been expected by chance, have been frequently detected both in humans and in model organisms. It has been common to consider the existence of hot spots as evidence for master regulation of gene expression. However, hot spots could also simply be due to highly correlated gene expressions or linkage disequilibrium and do not truly represent master regulators. A recent simulation study using real human gene expression data but simulated random single-nucleotide polymorphism genotypes showed patterns of clustering of expression quantitative trait loci that resemble those in actual studies [Perez-Enciso: Genetics 2004, 166: 547–554.]. In this study, to assess the credibility of transcription activity hot spots, we conducted genetic analyses on gene expressions provided by Genetic Analysis Workshop 15 Problem 1.Genetics, Biometrysw2206, tz33, yw2016Biostatistics, StatisticsArticlesJoint study of genetic regulators for expression traits related to breast cancer
https://academiccommons.columbia.edu/catalog/ac:184947
Zheng, Tian; Wang, Shuang; Cong, Lei; Ding, Yuejing; Ionita-Laza, Iuliana; Lo, Shaw-Hwa10.7916/D86T0KHXThu, 29 Jun 2017 03:38:30 +0000The mRNA expression levels of genes have been shown to have discriminating power for the classification of breast cancer. Studying the heritability of gene expression levels on breast cancer related transcripts can lead to the identification of shared common regulators and inter-regulation patterns, which would be important for dissecting the etiology of breast cancer.
We applied multilocus association genome-wide scans to 18 breast cancer related transcripts and combined the results with traditional linkage scans. Regulatory hotspots for these transcripts were identified and some inter-regulation patterns were observed. We also derived evidence on interacting genetic regulatory loci shared by a number of these transcripts.
In this paper, by restricting to a set of related genes, we were able to employ a more detailed multilocus approach that evaluates both marginal and interaction association signals at each single-nucleotide polymorphism. Interesting inter-regulation patterns and significant overlaps of genetic regulators between transcripts were observed. Interaction association results returned more expression quantitative trait locus hotspots that are significant.Genetics, Biometrytz33, sw2206, ii2135, shl5Statistics, BiostatisticsArticlesHow Many People Do You Know in Prison? Using Overdispersion in Count Data to Estimate Social Structure in Networks
https://academiccommons.columbia.edu/catalog/ac:185364
Zheng, Tian; Salganik, Matthew J.; Gelman, Andrew E.10.7916/D800011WThu, 29 Jun 2017 03:38:21 +0000Networks—sets of objects connected by relationships—are important in a number of fields. The study of networks has long been central to sociology, where researchers have attempted to understand the causes and consequences of the structure of relationships in large groups of people. Using insight from previous network research, Killworth et al. and McCarty et al. have developed and evaluated a method for estimating the sizes of hard-to-count populations using network data collected from a simple random sample of Americans. In this article we show how, using a multilevel overdispersed Poisson regression model, these data also can be used to estimate aspects of social structure in the population. Our work goes beyond most previous research on networks by using variation, as well as average responses, as a source of information. We apply our method to the data of McCarty et al. and find that Americans vary greatly in their number of acquaintances. Further, Americans show great variation in propensity to form ties to people in some groups (e.g., males in prison, the homeless, and American Indians), but little variation for other groups (e.g., twins, people named Michael or Nicole). We also explore other features of these data and consider ways in which survey data can be used to estimate network structure.Statistics, Social sciences--Researchtz33, ag389Statistics, Political ScienceArticlesProbing genetic overlap among complex human phenotypes
https://academiccommons.columbia.edu/catalog/ac:184989
Rzhetsky, Andrey; Wajngurt, David; Park, Naeun; Zheng, Tian10.7916/D8MS3RPRThu, 29 Jun 2017 03:38:21 +0000Geneticists and epidemiologists often observe that certain hereditary disorders cooccur in individual patients significantly more (or significantly less) frequently than expected, suggesting there is a genetic variation that predisposes its bearer to multiple disorders, or that protects against some disorders while predisposing to others. We suggest that, by using a large number of phenotypic observations about multiple disorders and an appropriate statistical model, we can infer genetic overlaps between phenotypes. Our proof-of-concept analysis of 1.5 million patient records and 161 disorders indicates that disease phenotypes form a highly connected network of strong pairwise correlations. Our modeling approach, under appropriate assumptions, allows us to estimate from these correlations the size of putative genetic overlaps. For example, we suggest that autism, bipolar disorder, and schizophrenia share significant genetic overlaps. Our disease network hypothesis can be immediately exploited in the design of genetic mapping approaches that involve joint linkage or association analyses of multiple seemingly disparate phenotypes.Biometry, Geneticsar345, tz33StatisticsArticlesBackward Genotype-Trait Association (BGTA)-Based Dissection of Complex Traits in Case-Control Designs
https://academiccommons.columbia.edu/catalog/ac:185325
Zheng, Tian; Wang, Hui; Lo, Shaw-Hwa10.7916/D8SF2V33Thu, 29 Jun 2017 03:38:20 +0000Background: The studies of complex traits project new challenges to current methods that evaluate association between genotypes and a specific trait. Consideration of possible interactions among loci leads to overwhelming dimensions that cannot be handled using current statistical methods. Methods: In this article, we evaluate a multi-marker screening algorithm--the backward genotype-trait association (BGTA) algorithm for case-control designs, which uses unphased multi-locus genotypes. BGTA carries out a global investigation on a candidate marker set and automatically screens out markers carrying diminutive amounts of information regarding the trait in question. To address the "too many possible genotypes, too few informative chromosomes" dilemma of a genomic-scale study that consists of hundreds to thousands of markers, we further investigate a BGTA-based marker selection procedure, in which the screening algorithm is repeated on a large number of random marker subsets. Results of these screenings are then aggregated into counts that the markers are retained by the BGTA algorithm. Markers with exceptional high counts of returns are selected for further analysis. Results and Conclusion: Evaluated using simulations under several disease models, the proposed methods prove to be more powerful in dealing with epistatic traits.We also demonstrate the proposed methods through an application to a study on the inflammatory bowel disease.Statistics, Genetics, Biometrytz33, hw2334, shl5Statistics, Microbiology and Immunology, BiostatisticsArticlesA demonstration and findings of a statistical approach through reanalysis of inflammatory bowel disease data
https://academiccommons.columbia.edu/catalog/ac:184986
Lo, Shaw-Hwa; Zheng, Tian10.7916/D8W95829Thu, 29 Jun 2017 03:37:35 +0000We test the backward haplotype transmission association algorithm on genome-scan data previously studied by Rioux et al. [Rioux, J. D., et al. (2000) Am. J. Hum. Genet. 66, 1863–1870]. In their study, multipoint linkage methods were applied to affected sib-pairs with inflammatory bowel disease, and significant linkage evidence points to two susceptibility loci. After we apply our approach to these data with a global search accounting for both joint and marginal effects, very interesting results emerge, many of them intriguing. These results provide compelling support for the application of our approach to other data wherever applicable. Results from this project also make it clear that it is important to reinvestigate available family-based datasets that can be suitably reanalyzed. Given previously collected data in the literature, our approach, with its increased efficiency in using available resources, draws additional crucial information that may lead to novel and surprising results.Biometry, Geneticsshl5, tz33StatisticsArticlesBackward Haplotype Transmission Association (BHTA) Algorithm-A Fast Multiple-Marker Screening Method
https://academiccommons.columbia.edu/catalog/ac:185361
Lo, Shaw-Hwa; Zheng, Tian10.7916/D87D2T2XThu, 29 Jun 2017 03:37:24 +0000The mapping of complex traits is one of the most important and central areas of human genetics today. Recent attention has been focused on genome scans using a large number of marker loci. Because complex traits are typically caused by multiple genes, the common approaches of mapping them by testing markers one after another fail to capture the substantial information of interactions among disease loci. Here we propose a backward haplotype transmission association (BHTA) algorithm to address this problem. The algorithm can administer a screening on any disease model when case- parent trio data are available. It identifies the important subset of an original larger marker set by eliminating the markers of least significance, one at a time, after a complete evaluation of its importance. In contrast with the existing methods, three major advantages emerge from this approach. First, it can be applied flexibly to arbitrary markers, regardless of their locations. Second, it takes into account haplotype information; it is more powerful in detecting the multifactorial traits in the presence of haplotypic association. Finally, the proposed method can potentially prove to be more efficient in future. genome wide scans, in terms of greater accuracy of gene detection and substantially reduced number of tests required in scans. We illustrate the performance of the algorithm with several examples, including one real data set with 31 markers for a study on the Gilles de la Tourette syndrome. Detailed theoretical justifications are also included, which explains why the algorithm is likely to select the ‘correct’ markers.Biometry, Geneticsshl5, tz33Statistics, BiostatisticsArticlesCorrection: A dual clustering framework for association screening with whole genome sequencing data and longitudinal traits
https://academiccommons.columbia.edu/catalog/ac:200852
Liu, Ying; Huang, Chien-Hsun; Hu, Inchi; Lo, Shaw-Hwa; Zheng, Tian10.7916/D8CV4G8ZWed, 28 Jun 2017 21:10:23 +0000Correction: For the previous publication of our article 1, Figure 1 was incorrectly processed as grayscale. We present, here in this correction, the original Figure in full color. Figure 1 Clustering of individuals using SNPs with MAFs between 0.01 and 0.05 for MAP4 Clustering of individuals using SNPs with MAFs between 0.01 and 0.05 for MAP4. A, Shown are 10 clusters, with the numbers at the top odds ratios within each partition block based on blood pressures. Each row is a SNP, and each column is an individual. SNPs are ordered with decreasing MAFs (from top to bottom). Green vertical bars indicate subjects with higher blood pressures (see text). Genotype aa is plotted in red, aA is plotted in blue, and AA is plotted in white (a denotes the minor allele). The partitions of the 849 individuals are indicated by dotted lines. Most partition elements are driven by similarity on rarer SNPs but not on more common SNPs. B, Clustering of individuals using their SBP curves from the first simulation. It can be seen that individuals are reasonably grouped into 1 high blood pressure cluster and 1 low blood pressure cluster.Genomics, Biometryyl2802, ch2526, shl5, tz33Biostatistics, StatisticsArticlesNew insights into old methods for identifying causal rare variants
https://academiccommons.columbia.edu/catalog/ac:195277
Hu, Inchi; Zheng, Tian; Huang, Chien-Hsun; Lo, Shaw-Hwa; Wang, Haitian10.7916/D8J38R1MWed, 28 Jun 2017 21:04:19 +0000The advance of high-throughput next-generation sequencing technology makes possible the analysis of rare variants. However, the investigation of rare variants in unrelated-individuals data sets faces the challenge of low power, and most methods circumvent the difficulty by using various collapsing procedures based on genes, pathways, or gene clusters. We suggest a new way to identify causal rare variants using the F-statistic and sliced inverse regression. The procedure is tested on the data set provided by the Genetic Analysis Workshop 17 (GAW17). After preliminary data reduction, we ranked markers according to their F-statistic values. Top-ranked markers were then subjected to sliced inverse regression, and those with higher absolute coefficients in the most significant sliced inverse regression direction were selected. The procedure yields good false discovery rates for the GAW17 data and thus is a promising method for future study on rare variants.Human genetics--Variation, Biometry--Statistical methods, Statistics--Methodology, Biometry, Statisticstz33, ch2526, shl5StatisticsArticlesA note on QTL detecting for censored traits
https://academiccommons.columbia.edu/catalog/ac:192015
Fang, Yixin10.7916/D8N58JVHWed, 28 Jun 2017 21:00:57 +0000Most existing statistical methods for mapping quantitative trait loci (QTL) assume that the phenotype follows a normal distribution and that it is fully observed. However, some phenotypes have skewed distributions and may be censored. This note proposes a simple and efficient approach to QTL detecting for censored traits with the Cox PH model without estimating the baseline hazard function which is "nuisance".Censored observations (Statistics), Genetics--Mathematical models, Genetics, Biometryyf2113StatisticsArticlesBAMarray™: Java software for Bayesian analysis of variance for microarray data
https://academiccommons.columbia.edu/catalog/ac:192099
Ishwaran, Hemant; Rao, J. Sunil; Kogalur, Udaya B.10.7916/D8BR8QNZWed, 28 Jun 2017 21:00:37 +0000Background: DNA microarrays open up a new horizon for studying the genetic determinants of disease. The high throughput nature of these arrays creates an enormous wealth of information, but also poses a challenge to data analysis. Inferential problems become even more pronounced as experimental designs used to collect data become more complex. An important example is multigroup data collected over different experimental groups, such as data collected from distinct stages of a disease process. We have developed a method specifically addressing these issues termed Bayesian ANOVA for microarrays (BAM). The BAM approach uses a special inferential regularization known as spike-and-slab shrinkage that provides an optimal balance between total false detections and total false non-detections. This translates into more reproducible differential calls. Spike and slab shrinkage is a form of regularization achieved by using information across all genes and groups simultaneously.
Results: BAMarray™ is a graphically oriented Java-based software package that implements the BAM method for detecting differentially expressing genes in multigroup microarray experiments (up to 256 experimental groups can be analyzed). Drop-down menus allow the user to easily select between different models and to choose various run options. BAMarray™ can also be operated in a fully automated mode with preselected run options. Tuning parameters have been preset at theoretically optimal values freeing the user from such specifications. BAMarray™ provides estimates for gene differential effects and automatically estimates data adaptive, optimal cutoff values for classifying genes into biological patterns of differential activity across experimental groups. A graphical suite is a core feature of the product and includes diagnostic plots for assessing model assumptions and interactive plots that enable tracking of prespecified gene lists to study such things as biological pathway perturbations. The user can zoom in and lasso genes of interest that can then be saved for downstream analyses.
Conclusion: BAMarray™ is user friendly platform independent software that effectively and efficiently implements the BAM methodology. Classifying patterns of differential activity is greatly facilitated by a data adaptive cutoff rule and a graphical suite. BAMarray™ is licensed software freely available to academic institutions. More information can be found at http://www.bamarray.com.DNA microarrays--Data processing, Java (Computer program language), Bioinformatics, Bayesian statistical decision theory, Statistics, Information technologyubk2101StatisticsArticlesToward Reproducible Computational Research: An Empirical Analysis of Data and Code Policy Adoption by Journals
https://academiccommons.columbia.edu/catalog/ac:174140
Stodden, Victoria C.; Guo, Peixuan; Ma, Zhaokun10.7916/D80K26NNWed, 28 Jun 2017 20:28:25 +0000Journal policy on research data and code availability is an important part of the ongoing shift toward publishing reproducible computational science. This article extends the literature by studying journal data sharing policies by year (for both 2011 and 2012) for a referent set of 170 journals. We make a further contribution by evaluating code sharing policies, supplemental materials policies, and open access status for these 170 journals for each of 2011 and 2012. We build a predictive model of open data and code policy adoption as a function of impact factor and publisher and find higher impact journals more likely to have open data and code policies and scientific societies more likely to have open data and code policies than commercial publishers. We also find open data policies tend to lead open code policies, and we find no relationship between open data and code policies and either supplemental material policies or open access journal status. Of the journals in this study, 38% had a data policy, 22% had a code policy, and 66% had a supplemental materials policy as of June 2012. This reflects a striking one year increase of 16% in the number of data policies, a 30% increase in code policies, and a 7% increase in the number of supplemental materials policies. We introduce a new dataset to the community that categorizes data and code sharing, supplemental materials, and open access policies in 2011 and 2012 for these 170 journals.Communication of technical information, Information sciencevcs2115, zm2168StatisticsArticlesMedication-Wide Association Studies
https://academiccommons.columbia.edu/catalog/ac:173912
Ryan, P. B.; Madigan, David B.; Stang, P. E.; Schuemie, M. J.; Hripcsak, George M.10.7916/D8PG1PVXWed, 28 Jun 2017 20:28:18 +0000Undiscovered side effects of drugs can have a profound effect on the health of the nation, and electronic health-care databases offer opportunities to speed up the discovery of these side effects. We applied a “medication-wide association study” approach that combined multivariate analysis with exploratory visualization to study four health outcomes of interest in an administrative claims database of 46 million patients and a clinical database of 11 million patients. The technique had good predictive value, but there was no threshold high enough to eliminate false-positive findings. The visualization not only highlighted the class effects that strengthened the review of specific products but also underscored the challenges in confounding. These findings suggest that observational databases are useful for identifying potential associations that warrant further consideration but are unlikely to provide definitive evidence of causal effects.Pharmacology, Statistics, Bioinformaticsdm2418, gh13Statistics, Biomedical InformaticsArticlesLearning Theory Analysis for Association Rules and Sequential Event Prediction
https://academiccommons.columbia.edu/catalog/ac:173905
Rudin, Cynthia; Letham, Benjamin; Madigan, David B.10.7916/D82N50C1Wed, 28 Jun 2017 20:28:13 +0000We present a theoretical analysis for prediction algorithms based on association rules. As part of this analysis, we introduce a problem for which rules are particularly natural, called “sequential event prediction." In sequential event prediction, events in a sequence are revealed one by one, and the goal is to determine which event will next be revealed. The training set is a collection of past sequences of events. An example application is to predict which item will next be placed into a customer's online shopping cart, given his/her past purchases. In the context of this problem, algorithms based on association rules have distinct advantages over classical statistical and machine learning methods: they look at correlations based on subsets of co-occurring past events (items a and b imply item c), they can be applied to the sequential event prediction problem in a natural way, they can potentially handle the “cold start" problem where the training set is small, and they yield interpretable predictions. In this work, we present two algorithms that incorporate association rules. These algorithms can be used both for sequential event prediction and for supervised classification, and they are simple enough that they can possibly be understood by users, customers, patients, managers, etc. We provide generalization guarantees on these algorithms based on algorithmic stability analysis from statistical learning theory. We include a discussion of the strict minimum support threshold often used in association rule mining, and introduce an “adjusted confidence" measure that provides a weaker minimum support condition that has advantages over the strict minimum support. The paper brings together ideas from statistical learning theory, association rule mining and Bayesian analysis.Statistics, Artificial intelligencedm2418StatisticsArticlesGenerating Productive Dialogue between Consulting Statisticians and their Clients in the Pharmaceutical and Medical Research Settings
https://academiccommons.columbia.edu/catalog/ac:173832
Emir, Birol; Amaratunga, Dhammika; Beltangady, Mohan; Cabrera, Javier; Freeman, Roy; Madigan, David B.; Nguyen, Ha H.; Whalen, Edward Patrick10.7916/D8PK0D8NWed, 28 Jun 2017 20:27:47 +0000Due to the ever-increasing complexity of scientific technologies and resulting data, consulting statisticians are becoming more involved in the design, conduct, and analysis of biomedical research. This requires extensive collaboration between the consulting statistician and nonstatisticians, such as researchers, clinicians, and corporate executives. Consequently, a successful consulting career is becoming ever more dependent on the statistician's ability to effectively communicate with nonstatisticians. This is especially true when more complex, nontraditional analytical methods are required. In this paper, we examine the collaboration between statisticians and nonstatisticians from three different professional perspectives. Integrating these perspectives, we discuss ways to help the consulting statistician generate productive dialogue with clients. Finally, we examine how universities can better prepare students for careers in statistical consulting by incorporating more communication-based elements into their curriculum and by offering students ample opportunities to collaborate with nonstatisticians. Overall, we designed this exercise to help the consulting statistician generate dialogue with clients that results in more productive collaborations and a more satisfying work experience.Statistics, Bioinformatics, Medicinebe2166, dm2418, hhn2108, ew2320StatisticsArticlesBayesian Hierarchical Rule Modeling for Predicting Medical Conditions
https://academiccommons.columbia.edu/catalog/ac:173882
McCormick, Tyler H.; Rudin, Cynthia; Madigan, David B.10.7916/D8V69GP1Wed, 28 Jun 2017 20:26:51 +0000We propose a statistical modeling technique, called the Hierarchical Association Rule Model (HARM), that predicts a patient’s possible future medical conditions given the patient’s current and past history of reported conditions. The core of our technique is a Bayesian hierarchical model for selecting predictive association rules (such as “condition 1 and condition 2 → condition 3”) from a large set of candidate rules. Because this method “borrows strength” using the conditions of many similar patients, it is able to provide predictions specialized to any given patient, even when little information about the patient’s history of conditions is available.Mathematics, Statistics, Medicinethm2105, dm2418StatisticsArticlesA Hierarchical Model for Association Rule Mining of Sequential Events: An Approach to Automated Medical Symptom Prediction
https://academiccommons.columbia.edu/catalog/ac:173838
McCormick, Tyler H.; Rudin, Cynthia; Madigan, David B.10.7916/D89C6VJDWed, 28 Jun 2017 20:25:55 +0000In many healthcare settings, patients visit healthcare professionals periodically and report multiple medical conditions, or symptoms, at each encounter. We propose a statistical modeling technique, called the Hierarchical Association Rule Model (HARM), that predicts a patient’s possible future symptoms given the patient’s current and past history of reported symptoms. The core of our technique is a Bayesian hierarchical model for selecting predictive association rules (such as “symptom 1 and symptom 2 → symptom 3 ”) from a large set of candidate rules. Because this method “borrows strength” using the symptoms of many similar patients, it is able to provide predictions specialized to any given patient, even when little information about the patient’s history of symptoms is available.Mathematics, Statistics, Medicinethm2105, dm2418StatisticsArticlesAlgorithms for Sparse Linear Classifiers in the Massive Data Setting
https://academiccommons.columbia.edu/catalog/ac:173908
Balakrishnan, Suhrid; Madigan, David B.; Bartlett, Peter10.7916/D8Z0368XWed, 28 Jun 2017 20:24:23 +0000Classifiers favoring sparse solutions, such as support vector machines, relevance vector machines, LASSO-regression based classifiers, etc., provide competitive methods for classification problems in high dimensions. However, current algorithms for training sparse classifiers typically scale quite unfavorably with respect to the number of training examples. This paper proposes online and multi-pass algorithms for training sparse linear classifiers for high dimensional data. These algorithms have computational complexity and memory requirements that make learning on massive data sets feasible. The central idea that makes this possible is a straightforward quadratic approximation to the likelihood function.Statistics, Artificial intelligencedm2418StatisticsArticlesLocation Estimation in Wireless Networks: A Bayesian Approach
https://academiccommons.columbia.edu/catalog/ac:173820
Madigan, David B.; Ju, Wen-Hua; Krishnan, P.; Krishnakumar, A. S.; Zorych, Ivan10.7916/D82V2D74Wed, 28 Jun 2017 20:23:52 +0000We present a Bayesian hierarchical model for indoor location estimation in wireless networks. We demonstrate that out model achieves accuracy that is similar to other published models and algorithms. By harnessing prior knowledge, our model drastically reduces the requirement for training data as compared with existing approaches.Mathematics, Statisticsdm2418StatisticsArticlesA One-Pass Sequential Monte Carlo Method for Bayesian Analysis of Massive Datasets
https://academiccommons.columbia.edu/catalog/ac:173899
Balakrishnan, Suhrid; Madigan, David B.10.7916/D8B56GTPWed, 28 Jun 2017 20:23:50 +0000For Bayesian analysis of massive data, Markov chain Monte Carlo (MCMC) techniques often prove infeasible due to computational resource constraints. Standard MCMC methods generally require a complete scan of the dataset for each iteration. Ridgeway and Madigan (2002) and Chopin (2002b) recently presented importance sampling algorithms that combined simulations from a posterior distribution conditioned on a small portion of the dataset with a reweighting of those simulations to condition on the remainder of the dataset. While these algorithms drastically reduce the number of data accesses as compared to traditional MCMC, they still require substantially more than a single pass over the dataset. In this paper, we present "1PFS," an efficient, one-pass algorithm. The algorithm employs a simple modification of the Ridgeway and Madigan (2002) particle filtering algorithm that replaces the MCMC based "rejuvenation" step with a more efficient "shrinkage" kernel smoothing based step. To show proof-of-concept and to enable a direct comparison, we demonstrate 1PFS on the same examples presented in Ridgeway and Madigan (2002), namely a mixture model for Markov chains and Bayesian logistic regression. Our results indicate the proposed scheme delivers accurate parameter estimates while employing only a single pass through the data.Mathematics, Statisticsdm2418StatisticsArticlesAnalysis of Variance of Cross-Validation Estimators of the Generalization Error
https://academiccommons.columbia.edu/catalog/ac:173902
Markatou, Marianthi; Tian, Hong; Biswas, Shameek; Hripcsak, George M.10.7916/D86D5R2XWed, 28 Jun 2017 20:23:49 +0000This paper brings together methods from two different disciplines: statistics and machine learning. We address the problem of estimating the variance of cross-validation (CV) estimators of the generalization error. In particular, we approach the problem of variance estimation of the CV estimators of generalization error as a problem in approximating the moments of a statistic. The approximation illustrates the role of training and test sets in the performance of the algorithm. It provides a unifying approach to evaluation of various methods used in obtaining training and test sets and it takes into account the variability due to different training and test sets. For the simple problem of predicting the sample mean and in the case of smooth loss functions, we show that the variance of the CV estimator of the generalization error is a function of the moments of the random variables Y=Card(Sj ∩ Sj') and Y*=Card(Sjc ∩ Sj'c), where Sj, Sj' are two training sets, and Sjc, Sj'c are the corresponding test sets. We prove that the distribution of Y and Y* is hypergeometric and we compare our estimator with the one proposed by Nadeau and Bengio (2003). We extend these results in the regression case and the case of absolute error loss, and indicate how the methods can be extended to the classification case. We illustrate the results through simulation.Statistics, Artificial intelligencemm168, ht2031, spb2003, gh13Biostatistics, Biomedical Informatics, StatisticsArticles[Least Angle Regression]: Discussion
https://academiccommons.columbia.edu/catalog/ac:173841
Madigan, David B.; Ridgeway, Greg10.7916/D81V5C29Wed, 28 Jun 2017 20:23:33 +0000Algorithms for simultaneous shrinkage and selection in regression and classification provide attractive solutions to knotty old statistical challenges. Nevertheless, as far as we can tell, Tibshirani's Lasso algorithm has had little impact on statistical practice. Two particular reasons for this may be the relative inefficiency of the original Lasso algorithm and the relative complexity of more recent Lasso algorithms [e.g., Osborne, Presnell and Turlach (2000)]. Efron, Hastie, Johnstone and Tibshirani have provided an efficient, simple algorithm for the Lasso as well as algorithms for stagewise regression and the new least angle regression. As such this paper is an important contribution to statistical computing.Mathematics, Statisticsdm2418StatisticsArticles[A Report on the Future of Statistics]: Comment
https://academiccommons.columbia.edu/catalog/ac:173850
Madigan, David B.; Stuetzle, Werner10.7916/D8D50K3VWed, 28 Jun 2017 20:23:32 +0000"Extraordinary opportunities for statistical ideas and for statisticians now present themselves. However, to take advantage of the opportunities, statistics has to change the way in which it recruits and trains students. Statistics has primarily focused on squeezing the maximum amount of information out of limited data. This paradigm is rapidly diminishing in importance and statistics education finds itself out of step with reality. The problems begin at the high school and undergraduate levels, where the standard course includes a narrow set of pre-computing-era topics. At the graduate level, the typical statistics program suffers from the same problem..." -- page 408Mathematics--Study and teaching, Education, Higherdm2418StatisticsArticlesCorrection: Separation and completeness properties for AMP chain graph Markov models
https://academiccommons.columbia.edu/catalog/ac:173887
Madigan, David B.; Levitz, Michael; Perlman, Michael D.10.7916/D8QF8R05Wed, 28 Jun 2017 20:23:16 +0000Correction of table 2 on page 1757 of 'Separation and completeness properties for AMP chain graph Markov models', Annals of Statistics, volume 29 (2001).Mathematics, Statisticsdm2418StatisticsArticlesBook Reviews: Principles of Data Mining. By David Hand, Heikki Mannila, and Padhraic Smyth.
https://academiccommons.columbia.edu/catalog/ac:173915
Madigan, David B.10.7916/D8DZ06D8Wed, 28 Jun 2017 20:23:08 +0000"Principles of Data Mining. By David Hand, Heikki Mannila, and Padhraic Smyth. MIT Press, Cambridge, MA, 2001. $50.00. xxxii+546 pp., hardcover. ISBN 0-262-08290-X. Is data mining the same as statistics? The distinguished authors of Principles of Data Mining struggle to make a distinction between the two subjects. In the end, what they have written is a fine applied statistics text." -- page 501Statisticsdm2418StatisticsReviewsBayesian Model Averaging: a Tutorial (with Comments by M. Clyde, David Draper and E. I. George, and a Rejoinder by the Authors)
https://academiccommons.columbia.edu/catalog/ac:173853
Hoeting, Jennifer A.; Madigan, David B.; Raftery, Adrian E.; Volinsky, Chris T.; Clyde, M.; Draper, David; George, E. I.10.7916/D84M92N7Wed, 28 Jun 2017 20:23:00 +0000Standard statistical practice ignores model uncertainty. Data analysts typically select a model from some class of models and then proceed as if the selected model had generated the data. This approach ignores the uncertainty in model selection, leading to over-confident inferences and decisions that are more risky than one thinks they are. Bayesian model averaging (BMA)provides a coherent mechanism for accounting for this model uncertainty. Several methods for implementing BMA have recently emerged. We discuss these methods and present a number of examples.In these examples, BMA provides improved out-of-sample predictive performance. We also provide a catalogue of currently available BMA software.Statisticsdm2418StatisticsArticlesSeparation and Completeness Properties for Amp Chain Graph Markov Models
https://academiccommons.columbia.edu/catalog/ac:173847
Levitz, Michael; Perlman, Michael D.; Madigan, David B.10.7916/D8X34VJGWed, 28 Jun 2017 20:23:00 +0000Pearl’s well-known d-separation criterion for an acyclic directed graph (ADG) is a pathwise separation criterion that can be used to efficiently identify all valid conditional independence relations in the Markov model determined by the graph. This paper introduces p-separation, a pathwise separation criterion that efficiently identifies all valid conditional independences under the Andersson–Madigan–Perlman (AMP) alternative Markov property for chain graphs (= adicyclic graphs), which include both ADGs and undirected graphs as special cases. The equivalence of p-separation to the augmentation criterion occurring in the AMP global Markov property is established, and p-separation is applied to prove completeness of the global Markov property for AMP chain graph models. Strong completeness of the AMP Markov property is established, that is, the existence of Markov perfect distributions that satisfy those and only those conditional independences implied by the AMP property(equivalently, by p-separation). A linear-time algorithm for determining p-separation is presented.Mathematics, Statisticsdm2418StatisticsArticlesA Characterization of Markov Equivalence Classes for Acyclic Digraphs
https://academiccommons.columbia.edu/catalog/ac:173896
Andersson, Steen A.; Madigan, David B.; Perlman, Michael D.10.7916/D8FX77J3Wed, 28 Jun 2017 20:22:39 +0000Undirected graphs and acyclic digraphs (ADG's), as well as their mutual extension to chain graphs, are widely used to describe dependencies among variables in multiviarate distributions. In particular, the likelihood functions of ADG models admit convenient recursive factorizations that often allow explicit maximum likelihood estimates and that are well suited to building Bayesian networks for expert systems. Whereas the undirected graph associated with a dependence model is uniquely determined, there may be many ADG's that determine the same dependence (i.e., Markov) model. Thus, the family of all ADG's with a given set of vertices is naturally partitioned into Markov-equivalence classes, each class being associated with a unique statistical model. Statistical procedures, such as model selection of model averaging, that fail to take into account these equivalence classes may incur substantial computational or other inefficiences. Here it is show that each Markov-equivalence class is uniquely determined by a single chain graph, the essential graph, that is itself simultaneously Markov equivalent to all ADG's in the equivalence class. Essential graphs are characterized, a polynomial-time algorithm for their construction is given, and their applications to model selection and other statistical questions are described.Mathematics, Statisticsdm2418StatisticsArticlesA Note on Equivalence Classes of Directed Acyclic Independence Graphs
https://academiccommons.columbia.edu/catalog/ac:173826
Madigan, David B.10.7916/D8TB150CWed, 28 Jun 2017 20:22:29 +0000Directed acyclic independence graphs (DAIGs) play an important role in recent developments in probabilistic expert systems and influence diagrams (Chyu [1]). The purpose of this note is to show that DAIGs can usefully be grouped into equivalence classes where the members of a single class share identical Markov properties. These equivalence classes can be identified via a simple graphical criterion. This result is particularly relevant to model selection procedures for DAIGs (see, e.g., Cooper and Herskovits [2] and Madigan and Raftery [4]) because it reduces the problem of searching among possible orientations of a given graph to that of searching among the equivalence classes.Mathematics, Statisticsdm2418StatisticsArticles[Bayesian Analysis in Expert Systems]: Comment: What's Next?
https://academiccommons.columbia.edu/catalog/ac:173856
Madigan, David B.10.7916/D8W37TFJWed, 28 Jun 2017 20:22:27 +0000"These papers represent two of the many different graphical modeling camps that have emerged from a flurry of activity in the past decade. The paper by Cox and Wermuth falls within the statistical graphical modeling camp and provides a useful generalization of that body of work. There is, of course, a price to be paid for this generality, namely that the interpretation of the graphs is more complex...The paper by Spiegelhalter, Dawid, Lauritzen and Cowell falls within the probabilistic expert system camp. This is a tour de force by researchers responsible for much of the astonishing progress in this area. Ten years ago, probabilistic models were shunned by the artificial intelligence community. That they are now widely accepted and used is due in large measure to the insights and efforts of these authors, along with other pioneers such as Judea Pearl and Peter Cheeseman..." -- page 261Mathematics, Statisticsdm2418StatisticsArticlesGenome-wide gene-based analysis of rheumatoid arthritis-associated interaction with PTPN22 and HLA-DRB1
https://academiccommons.columbia.edu/catalog/ac:184526
Lo, Shaw-Hwa; Chong, Lei; Xie, Jun; Huang, Chien Hsun; Qiao, Bo; Zheng, Tian10.7916/D8NP22VMWed, 28 Jun 2017 20:15:27 +0000The genes PTPN22 and HLA-DRB1 have been found by a number of studies to confer an increased risk for rheumatoid arthritis (RA), which indicates that both genes play an important role in RA etiology. It is believed that they not only have strong association with RA individually, but also interact with other related genes that have not been found to have predisposing RA mutations. In this paper, we conduct genome-wide searches for RA-associated gene-gene interactions that involve PTPN22 or HLA-DRB1 using the Genetic Analysis Workshop 16 Problem 1 data from the North American Rheumatoid Arthritis Consortium. MGC13017, HSPCAL3, MIA, PTPNS1L, and IGLVI-70, which showed association with RA in previous studies, have been confirmed in our analysis.Genetics, Biometryshl5, tz33StatisticsArticlesRheumatoid arthritis-associated gene-gene interaction network for rheumatoid arthritis candidate genes
https://academiccommons.columbia.edu/catalog/ac:184531
Zheng, Tian; Qiao, Bo; Lo, Shaw-Hwa; Xie, Jun; Huang, Chien-Hsun; Cong, Lei10.7916/D8HX1B3VWed, 28 Jun 2017 20:15:21 +0000Rheumatoid arthritis (RA, MIM 180300) is a chronic and complex autoimmune disease. Using the North American Rheumatoid Arthritis Consortium (NARAC) data set provided in Genetic Analysis Workshop 16 (GAW16), we used the genotype-trait distortion (GTD) scores and proposed analysis procedures to capture the gene-gene interaction effects of multiple susceptibility gene regions on RA. In this paper, we focused on 27 RA candidate gene regions (531 SNPs) based on a literature search. Statistical significance was evaluated using 1000 permutations. HLADRB1 was found to have strong marginal association with RA. We identified 14 significant interactions (p < 0.01), which were aggregated into an association network among 12 selected candidate genes PADI4, FCGR3, TNFRSF1B, ITGAV, BTLA, SLC22A4, IL3, VEGF, TNF, NFKBIL1, TRAF1-C5, and MIF. Based on our and other contributors' findings during the GAW16 conference, we further studied 24 candidate regions with 336 SNPs. We found 23 significant interactions (p-value < 0.01), nine interactions in addition to our initial findings, and the association network was extended to include candidate genes HLA-A, HLA-B, HLA-C, CTLA4, and IL6. As we will discuss in this paper, the reported possible interactions between genes may suggest potential biological activities of RA.Biometry, Geneticstz33, shl5, ch2526StatisticsArticlesR2WinBUGS: A Package for Running WinBUGS from R
https://academiccommons.columbia.edu/catalog/ac:154734
Sturtz, Sibylle; Ligges, Uwe; Gelman, Andrew E.10.7916/D80C55HHTue, 27 Jun 2017 15:43:29 +0000The R2WinBUGS package provides convenient functions to call WinBUGS from R. It automatically writes the data and scripts in a format readable by WinBUGS for processing in batch mode, which is possible since version 1.4. After the WinBUGS process has finished, it is possible either to read the resulting data into R by the package itself—which gives a compact graphical summary of inference and convergence diagnostics—or to use the facilities of the coda package for further analyses of the output. Examples are given to demonstrate the usage of this package.Statisticsag389StatisticsArticlesMultiple Imputation with Diagnostics (mi) in R: Opening Windows into the Black Box
https://academiccommons.columbia.edu/catalog/ac:154731
Su, Yu-Sung; Gelman, Andrew E.; Hill, Jennifer; Yajima, Masanao10.7916/D8VQ3CD3Tue, 27 Jun 2017 15:43:28 +0000Our mi package in R has several features that allow the user to get inside the imputation process and evaluate the reasonableness of the resulting models and imputations. These features include: choice of predictors, models, and transformations for chained imputation models; standard and binned residual plots for checking the fit of the conditional distributions used for imputation; and plots for comparing the distributions of observed and imputed data. In addition, we use Bayesian models and weakly informative prior distributions to construct more stable estimates of imputation models. Our goal is to have a demonstration package that (a) avoids many of the practical problems that arise with existing multivariate imputation programs, and (b) demonstrates state-of-the-art diagnostics that can be applied more generally and can be incorporated into the software of others.Statisticsag389StatisticsArticlesBayesian Statistical Pragmatism
https://academiccommons.columbia.edu/catalog/ac:154737
Gelman, Andrew E.10.7916/D8MC98QJTue, 27 Jun 2017 15:39:20 +0000I agree with Rob Kass’ point that we can and should make use of statistical methods developed under different philosophies, and I am happy to take the opportunity to elaborate on some of his arguments.Statisticsag389StatisticsArticlesSegregation in Social Networks Based on Acquaintanceship and Trust
https://academiccommons.columbia.edu/catalog/ac:154740
DiPrete, Thomas A.; Gelman, Andrew E.; McCormick, Tyler; Teitler, Julien O.; Zheng, Tian10.7916/D8F198DHTue, 27 Jun 2017 15:38:27 +0000Using 2006 General Social Survey data, the authors compare levels of segregation by race and along other dimensions of potential social cleavage in the contemporary United States. Americans are not as isolated as the most extreme recent estimates suggest. However, hopes that “bridging” social capital is more common in broader acquaintanceship networks than in core networks are not supported. Instead, the entire acquaintanceship network is perceived by Americans to be about as segregated as the much smaller network of close ties. People do not always know the religiosity, political ideology, family behaviors, or socioeconomic status of their acquaintances, but perceived social divisions on these dimensions are high, sometimes rivaling racial segregation in acquaintanceship networks. The major challenge to social integration today comes from the tendency of many Americans to isolate themselves from others who differ on race, political ideology, level of religiosity, and other salient aspects of social identity.Statisticstad61, ag389, thm2105, jot8, tz33StatisticsArticlesEnabling Reproducible Research: Open Licensing for Scientific Innovation
https://academiccommons.columbia.edu/catalog/ac:140147
Stodden, Victoria C.10.7916/D8N01H1ZMon, 26 Jun 2017 21:44:23 +0000There is a gap in the current licensing and copyright structure for the growing number of scientists releasing their research publicly, particularly on the Internet. Scientific research produces more scholarship than the final paper: for example, the code, data structures, experimental design and parameters, documentation, and figures, are all important both for communication of the scholarship and replication of the results. US copyright law is a barrier to the sharing of scientific scholarship since it establishes exclusive rights for creators over their work, thereby limiting the ability of others to copy, use, build upon, or alter the research. This is precisely opposite to prevailing scientific norms, which provide both that results be replicated before accepted as knowledge, and that scientific understanding be built upon previous discoveries for which authorship recognition is given. In accordance with these norms and to encourage the release of all scientific scholarship, I propose the Reproducible Research Standard (RRS) both to ensure attribution and facilitate the sharing of scientific works. Using the RRS on all components of scientific scholarship will encourage reproducible scientific investigation, facilitate greater collaboration, and promote engagement of the larger community in scientific learning and discovery.Communication of technical information, Intellectual propertyvcs2115StatisticsArticlesOpen science: policy implications for the evolving phenomenon of user-led scientific innovation
https://academiccommons.columbia.edu/catalog/ac:140127
Stodden, Victoria C.10.7916/D8183H2BMon, 26 Jun 2017 21:44:23 +0000From contributions of astronomy data and DNA sequences to disease treatment research, scientific activity by non-scientists is a real and emergent phenomenon, and raising policy questions. This involvement in science can be understood as an issue of access to publications, code, and data that facilitates public engagement in the research process, thus appropriate policy to support the associated welfare enhancing benefits is essential. Current legal barriers to citizen participation can be alleviated by scientists' use of the "Reproducible Research Standard," thus making the literature, data, and code associated with scientific results accessible. The enterprise of science is undergoing deep and fundamental changes, particularly in how scientists obtain results and share their work: the promise of open research dissemination held by the Internet is gradually being fulfilled by scientists. Contributions to science from beyond the ivory tower are forcing a rethinking of traditional models of knowledge generation, evaluation, and communication. The notion of a scientific "peer" is blurred with the advent of lay contributions to science raising questions regarding the concepts of peer-review and recognition. New collaborative models are emerging around both open scientific software and the generation of scientific discoveries that bear a similarity to open innovation models in other settings. Public engagement in science can be understood as an issue of access to knowledge for public involvement in the research process, facilitated by appropriate policy to support the welfare enhancing benefits deriving from citizen-science.Communication of technical information, Information sciencevcs2115StatisticsArticlesReproducible Research in Computational Harmonic Analysis
https://academiccommons.columbia.edu/catalog/ac:140150
Stodden, Victoria C.10.7916/D8RR27RZMon, 26 Jun 2017 21:44:23 +0000Scientific computation is emerging as absolutely central to the scientific method. Unfortunately, it's error-prone and currently immature—traditional scientific publication is incapable of finding and rooting out errors in scientific computation—which must be recognized as a crisis. An important recent development and a necessary response to the crisis is reproducible computational research in which researchers publish the article along with the full computational environment that produces the results. The authors have practiced reproducible computational research for 15 years and have integrated it with their scientific research and with doctoral and postdoctoral education. In this article, they review their approach and how it has evolved over time, discussing the arguments for and against working reproducibly.Communication of technical information, Information sciencevcs2115StatisticsArticlesReproducible Research: Addressing the Need for Data and Code Sharing in Computational Science
https://academiccommons.columbia.edu/catalog/ac:140124
Stodden, Victoria C.10.7916/D8WH30H4Mon, 26 Jun 2017 21:44:23 +0000Roundtable participants identified ways of making computational research details readily available, which is a crucial step in addressing the current credibility crisis.Communication of technical information, Information sciencevcs2115StatisticsArticlesThe Legal Framework for Reproducible Scientific Research: Licensing and Copyright
https://academiccommons.columbia.edu/catalog/ac:140153
Stodden, Victoria C.10.7916/D8H70RBPMon, 26 Jun 2017 21:44:23 +0000As computational researchers increasingly make their results available in a reproducible way, and often outside the traditional journal publishing mechanism, questions naturally arise with regard to copyright, subsequent use and citation, and ownership rights in general. The growing number of scientists who release their research publicly face a gap in the current licensing and copyright structure, particularly on the Internet. Scientific research produces more than the final paper: The code, data structures, experimental design and parameters, documentation, and figures are all important for scholarship communication and result replication. The author proposes the reproducible research standard for scientific researchers to use for all components of their scholarship that should encourage reproducible scientific investigation through attribution, facilitate greater collaboration, and promote engagement of the larger community in scientific learning and discovery.Communication of technical information, Intellectual propertyvcs2115StatisticsArticlesTrust Your Science? Open Your Data and Code
https://academiccommons.columbia.edu/catalog/ac:139369
Stodden, Victoria C.10.7916/D8CJ8Q0PMon, 26 Jun 2017 21:44:23 +0000This is a view on the reproducibility of computational sciences by Victoria Stodden. It contains information on the Reproducibility, Replicability, and Repeatability of code created by the other sciences. Stodden also talks about the rising prominence of computational sciences as we are in the digital age and what that means for the future of science and collecting data.Information sciencevcs2115StatisticsArticlesMultiscale Representations for Manifold-Valued Data
https://academiccommons.columbia.edu/catalog/ac:140178
Rahman, Inam Ur; Drori, Iddo; Stodden, Victoria C.; Donoho, David L.; Schroeder, Peter10.7916/D87371F4Mon, 26 Jun 2017 21:43:59 +0000We describe multiscale representations for data observed on equispaced grids and taking values in manifolds such as: the sphere S2, the special orthogonal group SO(3), the positive definite matrices SPD(n), and the Grassmann manifolds G(n, k). The representations are based on the deployment of Deslauriers-Dubuc and Average Interpolating pyramids "in the tangent plane" of such manifolds, using the Exp and Log maps of those manifolds. The representations provide "wavelet coefficients" which can be thresholded, quantized, and scaled much as traditional wavelet coefficients. Tasks such as compression, noise removal, contrast enhancement, and stochastic simulation are facilitated by this representation. The approach applies to general manifolds, but is particularly suited to the manifolds we consider, i.e. Riemanian symmetric spaces, such as Sn-1, SO(n), G(n, k), where the Exp and Log maps are effectively computable. Applications to manifold-valued data sources of a geometric nature (motion, orientation, diffusion) seem particularly immediate. A software toolbox, SymmLab, can reproduce the results discussed in this paper.Statisticsvcs2115StatisticsArticlesVirtual Northern Analysis of the Human Genome
https://academiccommons.columbia.edu/catalog/ac:140156
Hurowitz, Evan H.; Drori, Iddo; Stodden, Victoria C.; Brown, Patrick O.; Donoho, David L.10.7916/D8DR350JMon, 26 Jun 2017 21:41:45 +0000We applied the Virtual Northern technique to human brain mRNA to systematically measure human mRNA transcript lengths on a genome-wide scale. We used separation by gel electrophoresis followed by hybridization to cDNA microarrays to measure 8,774 mRNA transcript lengths representing at least 6,238 genes at high (>90%) confidence. By comparing these transcript lengths to the Refseq and H-Invitational full-length cDNA databases, we found that nearly half of our measurements appeared to represent novel transcript variants. Comparison of length measurements determined by hybridization to different cDNAs derived from the same gene identified clones that potentially correspond to alternative transcript variants. We observed a close linear relationship between ORF and mRNA lengths in human mRNAs, identical in form to the relationship we had previously identified in yeast. Some functional classes of protein are encoded by mRNAs whose untranslated regions (UTRs) tend to be longer or shorter than average; these functional classes were similar in both human and yeast. Human transcript diversity is extensive and largely unannotated. Our length dataset can be used as a new criterion for judging the completeness of cDNAs and annotating mRNA sequences. Similar relationships between the lengths of the UTRs in human and yeast mRNAs and the functions of the proteins they encode suggest that UTR sequences serve an important regulatory role among eukaryotes.Genetics, Molecular biologyvcs2115StatisticsArticlesA Flexible Bayesian Generalized Linear Model for Dichotomous Response Data with an Application to Text Categorization
https://academiccommons.columbia.edu/catalog/ac:173817
Eyheramendy, Susana; Madigan, David B.10.7916/D86M34ZFMon, 26 Jun 2017 20:40:43 +0000We present a class of sparse generalized linear models that include probit and logistic regression as special cases and offer some extra flexibility. We provide an EM algorithm for learning the parameters of these models from data. We apply our method in text classification and in simulated data and show that our method outperforms the logistic and probit models and also the elastic net, in general by a substantial margin.Mathematics, Statisticsdm2418StatisticsChapters (layout features)A Global Empirical Evaluation of New Communication Technology Use and Democratic Tendency
https://academiccommons.columbia.edu/catalog/ac:140144
Stodden, Victoria C.10.7916/D8XW4V5GMon, 26 Jun 2017 20:26:57 +0000Is the dramatic increase in Internet use associated with a commensurate rise in democracy? Few previous studies have drawn on multiple perception-based measures of governance to assess the Internets effects on the process of democratization. This paper uses perception-based time series data on "Voice & Accountability," "Political Stability," and "Rule of Law" to provide insights into democratic tendency. The results of regression analysis suggest that the level of "Voice & Accountability" in a country increases with Internet use, while the level of "Political Stability" decreases with increasing Internet use. Additionally, Internet use was found to increase significantly for countries with increasing levels of "Voice & Accountability" In contrast, "Rule of Law" was not significantly affected by a country's level of Internet use. Increasing cell phone use did not seem to affect either "Voice & Accountability", "Political Stability" or "Rule of Law." In turn, cell phone use was not affected by any of these three measures of democratic tendency. When limiting our analysis to autocratic regimes, we noted a significant negative effect of Internet and cell phone use on "Political Stability" and found that the "Rule of Law" and "Political Stability" metrics drove ICT adoption.Internet, Political sciencevcs2115StatisticsArticlesInnovation and Growth through Open Access to Scientific Research: Three Ideas for High-Impact Rule Changes
https://academiccommons.columbia.edu/catalog/ac:139585
Stodden, Victoria C.10.7916/D82N5BNRMon, 26 Jun 2017 20:26:57 +0000A paper on Data Policies by Victoria Stodden where she explores the framing principles that should be applied to the reproduction of computational research and results and how those principles should be used to guide scientific policy during the digital age.Communication of technical information, Intellectual propertyvcs2115StatisticsArticlesWhen Does Non-Negative Matrix Factorization Give a Correct Decomposition into Parts?
https://academiccommons.columbia.edu/catalog/ac:140175
Donoho, David L.; Stodden, Victoria C.10.7916/D88D05N7Mon, 26 Jun 2017 20:25:26 +0000We interpret non-negative matrix factorization geometrically, as the problem of finding a simplicial cone which contains a cloud of data points and which is contained in the positive orthant. We show that under certain conditions, basically requiring that some of the data are spread across the faces of the positive orthant, there is a unique such simplicial cone. We give examples of synthetic image articulation databases which obey these conditions; these require separated support and factorial sampling. For such databases there is a generative model in terms of "parts" and NMF correctly identifies the "parts". We show that our theoretical results are predictive of the performance of published NMF code, by running the published algorithms on one of our synthetic image articulation databases.Statisticsvcs2115StatisticsArticlesFast <em>l</em>1 Minimization for Genomewide Analysis of mRNA Lengths
https://academiccommons.columbia.edu/catalog/ac:140172
Drori, Iddo; Stodden, Victoria C.; Hurowitz, Evan H.10.7916/D80V8P4RMon, 26 Jun 2017 20:25:25 +0000Application of the virtual northern method to human mRNA allows us to systematically measure transcript length on a genome-wide scale [1]. Characterization of RNA transcripts by length provides a measurement which complements cDNA sequencing. We have robustly extracted the lengths of the transcripts expressed by each gene for comparison with the Unigene, Refseq, and H-Invitational databases [2, 3]. Obtaining an accurate probability for each peak requires performing multiple bootstrap simulations, each involving a deconvolution operation which is equivalent to finding the sparsest non-negative solution of an underdetermined system of equations. This process is computationally intensive for a large number of simulations and genes. In this contribution we present an efficient approximation method which is faster than general purpose solvers by two orders of magnitude, and in practice reduces our processing time from a week to hours.Genetics, Statisticsvcs2115StatisticsArticlesBreakdown Point of Model Selection When the Number of Variables Exceeds the Number of Observations
https://academiccommons.columbia.edu/catalog/ac:140168
Donoho, David L.; Stodden, Victoria C.10.7916/D84M9DXZMon, 26 Jun 2017 20:25:24 +0000The classical multivariate linear regression problem assumes p variables X1, X2, ... , Xp and a response vector y, each with n observations, and a linear relationship between the two: y = X beta + z, where z ~ N(0, sigma2). We point out that when p > n, there is a breakdown point for standard model selection schemes, such that model selection only works well below a certain critical complexity level depending on n/p. We apply this notion to some standard model selection algorithms (Forward Stepwise, LASSO, LARS) in the case where pGtn. We find that 1) the breakdown point is well-de ned for random X-models and low noise, 2) increasing noise shifts the breakdown point to lower levels of sparsity, and reduces the model recovery ability of the algorithm in a systematic way, and 3) below breakdown, the size of coefficient errors follows the theoretical error distribution for the classical linear model.Statisticsvcs2115StatisticsArticlesScientists, Share Secrets or Lose Funding
https://academiccommons.columbia.edu/catalog/ac:147742
Stodden, Victoria C.10.7916/D85B0BP4Mon, 26 Jun 2017 14:29:32 +0000More and more published scientific studies are difficult or impossible to repeat. Federal agencies that fund scientific research are in a position to help fix this problem. They should require that all scientists whose studies they finance share the files that generated their published findings, the raw data and the computer instructions that carried out their analysis.Communication of technical information, Intellectual propertyvcs2115StatisticsArticlesSource codes for GLMLE algorithm
https://academiccommons.columbia.edu/catalog/ac:178966
He, Ran; Zheng, Tian10.7916/D8HH6HQRWed, 21 Jun 2017 13:55:17 +0000These are the R source codes for the algorithm proposed for fitting exponential random graph models (ERGMs) on large social networks in our paper "Estimation of exponential random graph models for large social networks via graph limits". Specifically, the ERGM model we implement is the one that consider homomorphism densities of edges, two-stars and triangles, the one we examine in the above paper.Statistics, Computer sciencerh2528, tz33StatisticsSoftwareSPAr package for Fan and Lo (2013) "A robust model-free approach for rare variants association studies incorporating gene-gene and gene-environmental interactions."
https://academiccommons.columbia.edu/catalog/ac:179424
Fan, Ruixue; Lo, Shaw-Hwa10.7916/D84Q7SN6Wed, 21 Jun 2017 13:55:16 +0000Recently more and more evidence suggest that rare variants with much lower minor allele frequencies play significant roles in disease etiology. Advances in next-generation sequencing technologies will lead to many more rare variants association studies. Several statistical methods have been proposed to assess the effect of rare variants by aggregating information from multiple loci across a genetic region and testing the association between the phenotype and aggregated genotype. One limitation of existing methods is that they only look into the marginal effects of rare variants but do not systematically take into account effects due to interactions among rare variants and between rare variants and environmental factors. In this article, we propose the summation of partition approach (SPA), a robust model-free method that is designed specifically for detecting both marginal effects and effects due to gene-gene (G×G) and gene-environmental (G×E) interactions for rare variants association studies. SPA has three advantages. First, it accounts for the interaction information and gains considerable power in the presence of unknown and complicated G×G or G×E interactions. Secondly, it does not sacrifice the marginal detection power; in the situation when rare variants only have marginal effects it is comparable with the most competitive method in current literature. Thirdly, it is easy to extend and can incorporate more complex interactions; other practitioners and scientists can tailor the procedure to fit their own study friendly. Our simulation studies show that SPA is considerably more powerful than many existing methods in the presence of G×G and G×E interactions.
This package is also maintained on the Comprehensive R Archive Network (http://cran.r-project.org). It contains the R programs, user's manual and example codes.Genetics, Statisticsrf2283, shl5StatisticsSoftware