Academic Commons Search Results
https://academiccommons.columbia.edu/catalog?action=index&controller=catalog&f%5Bdepartment_facet%5D%5B%5D=Statistics&format=rss&fq%5B%5D=has_model_ssim%3A%22info%3Afedora%2Fldpd%3AContentAggregator%22&q=&rows=500&sort=record_creation_date+desc
Academic Commons Search Resultsen-usRich state, poor state, red state, blue state: What's the matter with Connecticut?
https://academiccommons.columbia.edu/catalog/ac:125297
Gelman, Andrew E.; Shor, Boris; Bafumi, Joseph; Park, David K.10.7916/D8WD45S4Thu, 13 Apr 2017 15:46:17 +0000For decades, the Democrats have been viewed as the party of the poor, with the Republicans representing the rich. Recent presidential elections, however, have shown a reverse pattern, with Democrats performing well in the richer blue states in the northeast and coasts, and Republicans dominating in the red states in the middle of the country and the south. Through multilevel modeling of individual-level survey data and county- and state-level demographic and electoral data, we reconcile these patterns. Furthermore, we find that income matters more in red America than in blue America. In poor states, rich people are much more likely than poor people to vote for the Republican presidential candidate, but in rich states (such as Connecticut), income has a very low correlation with vote preference.Mathematical statisticsag389StatisticsArticlesPartisans without constraint: Political polarization and trends in American public opinion
https://academiccommons.columbia.edu/catalog/ac:125291
Baldassarri, Delia; Gelman, Andrew E.10.7916/D84T6QK4Thu, 13 Apr 2017 15:46:16 +0000Public opinion polarization is here conceived as a process of alignment along multiple lines of potential disagreement and measured as growing constraint in individuals' preferences. Using NES data from 1972 to 2004, the authors model trends in issue partisanship--the correlation of issue attitudes with party identification--and issue alignment--the correlation between pairs of issues--and find a substantive increase in issue partisanship, but little evidence of issue alignment. The findings suggest that opinion changes correspond more to a resorting of party labels among voters than to greater constraint on issue attitudes: since parties are more polarized, they are now better at sorting individuals along ideological lines. Levels of constraint vary across population subgroups: strong partisans and wealthier and politically sophisticated voters have grown more coherent in their beliefs. The authors discuss the consequences of partisan realignment and group sorting on the political process and potential deviations from the classic pluralistic account of American politics.Mathematical statisticsag389StatisticsArticlesStruggles with survey weighting and regression modeling
https://academiccommons.columbia.edu/catalog/ac:125309
Gelman, Andrew E.10.7916/D8H41XN4Thu, 13 Apr 2017 15:46:16 +0000The general principles of Bayesian data analysis imply that models for survey responses should be constructed conditional on all variables that affect the probability of inclusion and nonresponse, which are also the variables used in survey weighting and clustering. However, such models can quickly become very complicated, with potentially thousands of poststratification cells. It is then a challenge to develop general families of multilevel probability models that yield reasonable Bayesian inferences. We discuss in the context of several ongoing public health and social surveys. This work is currently open-ended, and we conclude with thoughts on how research could proceed to solve these problems.Mathematical statisticsag389StatisticsArticlesRejoinder: Struggles with survey weighting and regression modeling
https://academiccommons.columbia.edu/catalog/ac:125312
Gelman, Andrew E.10.7916/D8CC15WBThu, 13 Apr 2017 15:46:16 +0000I was motivated to write this paper, with its controversial opening line, "Survey weighting is a mess," from various experiences as an applied statistician.Mathematical statisticsag389StatisticsArticlesBayes, Jeffreys, Prior Distributions and the Philosophy of Statistics
https://academiccommons.columbia.edu/catalog/ac:125279
Gelman, Andrew E.10.7916/D8J38ZTDThu, 13 Apr 2017 15:46:16 +0000I actually own a copy of Harold Jeffreys's Theory of Probability but have only read small bits of it, most recently over a decade ago to confirm that, indeed, Jeffreys was not too proud to use a classical chi-squared p-value when he wanted to check the misfit of a model to data (Gelman, Meng and Stern, 2006). I do, however, feel that it is important to understand where our probability models come from, and I welcome the opportunity to use the present article by Robert, Chopin and Rousseau as a platform for further discussion of foundational issues. In this brief discussion I will argue the following: (1) in thinking about prior distributions, we should go beyond Jeffreys's principles and move toward weakly informative priors; (2) it is natural for those of us who work in social and computational sciences to favor complex models, contra Jeffreys's preference for simplicity; and (3) a key generalization of Jeffreys's ideas is to explicitly include model checking in the process of data analysis.Mathematical statisticsag389StatisticsArticlesThe playing field shifts: Predicting the seats-votes curve in the 2008 U.S. House election
https://academiccommons.columbia.edu/catalog/ac:125285
Kastellec, Jonathan P.; Gelman, Andrew E.; Chandler, Jamie P.10.7916/D8DB873GThu, 13 Apr 2017 15:46:16 +0000The 2008 U.S. House elections mark the first time since 1994 that the Democrats will seek to retain a majority. With the political climate favoring Democrats this year, it seems almost certain that the party will retain control, and will likely increase its share of seats. In five national polls taken in June of this year, Democrats enjoyed on average a 13-point advantage in the generic congressional ballot; as Bafumi, Erikson, and Wlezien (2007) point out, these early polls, suitably adjusted, are good predictors of the November vote. As of late July, bettors at intrade.com put the probability of the Democrats retaining a majority at about 95% (Intrade.com 2008). Elsewhere in this symposium, Klarner (2008) predicts an 11-seat gain for the Democrats, while Lockerbie (2008) forecasts a 25-seat pickup. In this paper we document how the electoral playing field has shifted from a Republican advantage between 1996 and 2004 to a Democratic tilt today. In an earlier article (Kastellec, Gelman, and Chandler 2008), we predicted the seats-votes curve in the 2006 election, showing how the Democrats faced an uphill battle in their effort to take control of the House and, their victory notwithstanding, ended up winning a lower percentage of seats than their average district vote nationwide. We follow up on this analysis by using the same method to predict the seats-votes curve in 2008. Due to the shift in incumbency advantage from the Republicans to the Democrats, compounded by a greater number of retirements among Republican members, we show that the Democrats now enjoy a partisan bias, and can expect to win more seats than votes for the first time since 1992. While this bias is not as large as the advantage the Republicans held in 2006, it will likely help the Democrats increase their share of seats.Mathematical statisticsjpk2004, ag389StatisticsArticlesDiscussion of the Article "Website Morphing"
https://academiccommons.columbia.edu/catalog/ac:125288
Gelman, Andrew E.10.7916/D88K7G9VThu, 13 Apr 2017 15:46:16 +0000The article under discussion illustrates the trade-off between optimization and exploration that is fundamental to statistical experimental design. In this discussion, I suggest that the research under discussion could be made even more effective by checking the fit of the model by comparing observed data to replicated data sets simulated from the fitted model.Mathematical statisticsag389StatisticsArticlesPredicting and dissecting the seats-votes curve in the 2006 U.S. House election
https://academiccommons.columbia.edu/catalog/ac:125294
Kastellec, Jonathan P.; Gelman, Andrew E.; Chandler, Jamie P.10.7916/D8125ZW5Thu, 13 Apr 2017 15:46:14 +0000The 2008 U.S. House elections mark the first time since 1994 that the Democrats will seek to retain a majority. With the political climate favoring Democrats this year, it seems almost certain that the party will retain control, and will likely increase its share of seats. In five national polls taken in June of this year, Democrats enjoyed on average a 13-point advantage in the generic congressional ballot; as Bafumi, Erikson, and Wlezien (2007) point out, these early polls, suitably adjusted, are good predictors of the November vote. As of late July, bettors at intrade.com put the probability of the Democrats retaining a majority at about 95% (Intrade.com 2008). Elsewhere in this symposium, Klarner (2008) predicts an 11-seat gain for the Democrats, while Lockerbie (2008) forecasts a 25-seat pickup. In this paper we document how the electoral playing field has shifted from a Republican advantage between 1996 and 2004 to a Democratic tilt today. In an earlier article (Kastellec, Gelman, and Chandler 2008), we predicted the seats-votes curve in the 2006 election, showing how the Democrats faced an uphill battle in their effort to take control of the House and, their victory notwithstanding, ended up winning a lower percentage of seats than their average district vote nationwide. We follow up on this analysis by using the same method to predict the seats-votes curve in 2008. Due to the shift in incumbency advantage from the Republicans to the Democrats, compounded by a greater number of retirements among Republican members, we show that the Democrats now enjoy a partisan bias, and can expect to win more seats than votes for the first time since 1992. While this bias is not as large as the advantage the Republicans held in 2006, it will likely help the Democrats increase their share of seats.Mathematical statisticsjpk2004, ag389StatisticsArticlesComment: Bayesian Checking of the Second Levels of Hierarchical Models
https://academiccommons.columbia.edu/catalog/ac:125303
Gelman, Andrew E.10.7916/D8RN3F38Thu, 13 Apr 2017 15:46:13 +0000Bayarri and Castellanos (BC) have written an interesting paper discussing two forms of posterior model check, one based on cross-validation and one based on replication of new groups in a hierarchical model. We think both these checks are good ideas and can become even more effective when understood in the context of posterior predictive checking. For the purpose of discussion, however, it is most interesting to focus on the areas where we disagree with BC.Mathematical statisticsag389StatisticsArticlesBayes: Radical, liberal, or conservative?
https://academiccommons.columbia.edu/catalog/ac:125306
Gelman, Andrew E.10.7916/D8MW2PCJThu, 13 Apr 2017 15:46:12 +0000Mathematical statisticsag389StatisticsArticlesRandom Walk Models, Preferential Attachment, and Sequential Monte Carlo Methods for Analysis of Network Data
https://academiccommons.columbia.edu/catalog/ac:209294
Bloem-Reddy, Benjamin Michaelhttp://dx.doi.org/10.7916/D8348R5QWed, 22 Mar 2017 18:09:32 +0000Networks arise in nearly every branch of science, from biology and physics to sociology and economics. A signature of many network datasets is strong local dependence, which gives rise to phenomena such as sparsity, power law degree distributions, clustering, and structural heterogeneity. Statistical models of networks require a careful balance of flexibility to faithfully capture that dependence, and simplicity, to make analysis and inference tractable. In this dissertation, we introduce a class of models that insert one network edge at a time via a random walk, permitting the location of new edges to depend explicitly on the structure of the existing network, while remaining probabilistically and computationally tractable. Connections to graph kernels are made through the probability generating function of the random walk length distribution. The limiting degree distribution is shown to exhibit power law behavior, and the properties of the limiting degree sequence are studied analytically with martingale methods. In the second part of the dissertation, we develop a class of particle Markov chain Monte Carlo algorithms to perform inference for a large class of sequential random graph models, even when the observation consists only of a single graph. Using these methods, we derive a particle Gibbs sampler for random walk models. Fit to synthetic data, the sampler accurately recovers the model parameters; fit to real data, the model offers insight into the typical length scale of dependence in the network, and provides a new measure of vertex centrality.
The arrival times of new vertices are the key to obtaining results for both theory and inference. In the third part, we undertake a careful study of the relationship between the arrival times, sparsity, and heavy tailed degree distributions in preferential attachment-type models of partitions and graphs. A number of constructive representations of the limiting degrees are obtained, and connections are made to exchangeable Gibbs partitions as well as to recent results on the limiting degrees of preferential attachment graphs.Statistics, Monte Carlo method, Computer networks, Markov processes, Information networks--Statistical methodsbmr2136StatisticsDissertationsEstimation of Total Body Skeletal Muscle Mass in Chinese Adults: Prediction Model by Dual-Energy X-Ray Absorptiometry
https://academiccommons.columbia.edu/catalog/ac:207364
Zhao, Xinyu; Wang, ZiMian; Zhang, Junyi; Hua, Jianming; He, Wei; Zhu, Shankuanhttp://dx.doi.org/10.7916/D8MS3ZGJMon, 27 Feb 2017 14:42:47 +0000Background: There are few reports on total body skeletal muscle mass (SM) in Chinese. The objective of this study is to establish a prediction model of SM for Chinese adults.
Methodology: Appendicular lean soft tissue (ALST) was measured by dual energy X-ray absorptiometry (DXA) and SM by magnetic resonance image (MRI) in 66 Chinese adults (52 men and 14 women). Images of MRI were segmented into compartments including intermuscular adipose tissue (IMAT) and IMAT-free SM. Regression was used to fit the prediction model SM = c + k × ALST. Age and gender were adjusted in the fitted model. The piece-wise linear function was performed to further explore the effect of age on SM. ‘Leave-One-Out Cross Validation’ was utilized to evaluate the prediction performance. The significance of observed differences between predicted and actual SM was tested by t test and the level of agreement was assessed by the method of Bland and Altman.
Results: Men had greater ALST and IMAT-free SM than women. ALST was the primary predictor and highly correlated with IMAT-free SM (R2 = 0.94, SEE = 1.11 kg, P<0.001). Age was an additional predictor (SM prediction model with age adjusted R2 = 0.95, SEE = 1.05 kg, P<0.001). There was a piece-wise linear relationship between age and IMAT-free SM: IMAT-free SM = 1.21×ALST−0.98, (Age <45 years) and IMAT-free SM = 1.21×ALST−0.98−0.04× (Age−45), (Age ≥45years). The prediction performance of this age-adjusted model was good due to ‘Leave-One-Out Cross Validation’. No significant difference between measured and predicted IMAT-free SM was detected.
Conclusion: Previous SM prediction model developed in multi-ethnic groups underestimated SM by 2.3% and 3.4% for Chinese men and women. A new prediction model by DXA has been established to predict SM in Chinese adults.Biology, Musculoskeletal system, Muscles, Chinese--Health and hygiene, Human anatomyCollege of Physicians and Surgeons, StatisticsArticlesFlexible Sparse Learning of Feature Subspaces
https://academiccommons.columbia.edu/catalog/ac:207319
Ma, Yutinghttp://dx.doi.org/10.7916/D83X8CBBThu, 23 Feb 2017 18:09:44 +0000It is widely observed that the performances of many traditional statistical learning methods degenerate when confronted with high-dimensional data. One promising approach to prevent this downfall is to identify the intrinsic low-dimensional spaces where the true signals embed and to pursue the learning process on these informative feature subspaces. This thesis focuses on the development of flexible sparse learning methods of feature subspaces for classification. Motivated by the success of some existing methods, we aim at learning informative feature subspaces for high-dimensional data of complex nature with better flexibility, sparsity and scalability.
The first part of this thesis is inspired by the success of distance metric learning in casting flexible feature transformations by utilizing local information. We propose a nonlinear sparse metric learning algorithm using a boosting-based nonparametric solution to address metric learning problem for high-dimensional data, named as the sDist algorithm. Leveraged a rank-one decomposition of the symmetric positive semi-definite weight matrix of the Mahalanobis distance metric, we restructure a hard global optimization problem into a forward stage-wise learning of weak learners through a gradient boosting algorithm. In each step, the algorithm progressively learns a sparse rank-one update of the weight matrix by imposing an L-1 regularization. Nonlinear feature mappings are adaptively learned by a hierarchical expansion of interactions integrated within the boosting framework. Meanwhile, an early stopping rule is imposed to control the overall complexity of the learned metric. As a result, without relying on computationally intensive tools, our approach automatically guarantees three desirable properties of the final metric: positive semi-definiteness, low rank and element-wise sparsity. Numerical experiments show that our learning model compares favorably with the state-of-the-art methods in the current literature of metric learning.
The second problem arises from the observation of high instability and feature selection bias when applying online methods to highly sparse data of large dimensionality for sparse learning problem. Due to the heterogeneity in feature sparsity, existing truncation-based methods incur slow convergence and high variance. To mitigate this problem, we introduce a stabilized truncated stochastic gradient descent algorithm. We employ a soft-thresholding scheme on the weight vector where the imposed shrinkage is adaptive to the amount of information available in each feature. The variability in the resulted sparse weight vector is further controlled by stability selection integrated with the informative truncation. To facilitate better convergence, we adopt an annealing strategy on the truncation rate. We show that, when the true parameter space is of low dimension, the stabilization with annealing strategy helps to achieve lower regret bound in expectation.Statistics, Statistics, Mathematical statistics, Machine learning--Statistical methods, Machine learningym2396StatisticsDissertationsMachine learning and data mining in complex genomic data a review on the lessons learned in Genetic Analysis Workshop Nineteen
https://academiccommons.columbia.edu/catalog/ac:206639
Konig, Inke R.; Auerbach, Jonathan Lyle; Gola, Damian; Held, Elizabeth; Holzinger, Emily R.; Legault, Marc Andre; Sun, Rui; Tintle, Nathan; Yang, Hsin Chouhttp://dx.doi.org/10.7916/D8HT2TZ6Mon, 30 Jan 2017 15:47:55 +0000In the analysis of current genomic data, application of machine learning and data mining techniques has become more attractive given the rising complexity of the projects. As part of the Genetic Analysis Workshop 19, approaches from this domain were explored, mostly motivated from two starting points. First, assuming an underlying structure in the genomic data, data mining might identify this and thus improve downstream association analyses. Second, computational methods for machine learning need to be developed further to efficiently deal with the current wealth of data.
In the course of discussing results and experiences from the machine learning and data mining approaches, six common messages were extracted. These depict the current state of these approaches in the application to complex genomic data. Although some challenges remain for future studies, important forward steps were taken in the integration of different data types and the evaluation of the evidence. Mining the data for underlying genetic or phenotypic structure and using this information in subsequent analyses proved to be extremely helpful and is likely to become of even greater use with more complex data sets.Genomics, Genomics, Machine learning, Data miningjla2167StatisticsArticlesAdvances in Credit Risk Modeling
https://academiccommons.columbia.edu/catalog/ac:206336
Neuberg, Richardhttp://dx.doi.org/10.7916/D84T6JZ0Fri, 20 Jan 2017 18:09:03 +0000Following the recent financial crisis, financial regulators have placed a strong emphasis on reducing expectations of government support for banks, and on better managing and assessing risks in the banking system. This thesis considers three current topics in credit risk and the statistical problems that arise there.
The first of these topics is expectations of government support in distressed banks. We utilize unique features of the European credit default swap market to find that market expectations of European government support for distressed banks have decreased -- an important development in the credibility of financial reforms.
The second topic we treat is the estimation of covariance matrices from the perspective of market risk management. This problem arises, for example, in the central clearing of credit default swaps. We propose several specialized loss functions, and a simple but effective visualization tool to assess estimators. We find that proper regularization significantly improves the performance of dynamic covariance models in estimating portfolio variance.
The third topic we consider is estimation risk in the pricing of financial products. When parameters are not known with certainty, a better informed counterparty may strategically pick mispriced products. We discuss how total estimation risk can be minimized approximately. We show how a premium for remaining estimation risk may be determined when one counterparty is better informed than the other, but a market collapse is to be avoided, using a simple example from loan pricing. We illustrate the approach with credit bureau data.Statistics, Finance, Credit--Management--Statistical methods, Financial risk, Financial risk management, Finance--Statistical methods, Finance--Statisticsrn2325Statistics, BusinessDissertationsTensor Analysis Reveals Distinct Population Structure that Parallels the Different Computational Roles of Areas M1 and V1
https://academiccommons.columbia.edu/catalog/ac:205830
Seely, Jeffrey Scott; Kaufman, Matthew T.; Ryu, Stephen I.; Shenoy, Krishna V.; Cunningham, John Patrick; Churchland, Mark M.http://dx.doi.org/10.7916/D8N29XF1Tue, 13 Dec 2016 12:56:44 +0000Cortical firing rates frequently display elaborate and heterogeneous temporal structure. One often wishes to compute quantitative summaries of such structure—a basic example is the frequency spectrum—and compare with model-based predictions. The advent of large-scale population recordings affords the opportunity to do so in new ways, with the hope of distinguishing between potential explanations for why responses vary with time. We introduce a method that assesses a basic but previously unexplored form of population-level structure: when data contain responses across multiple neurons, conditions, and times, they are naturally expressed as a third-order tensor. We examined tensor structure for multiple datasets from primary visual cortex (V1) and primary motor cortex (M1). All V1 datasets were ‘simplest’ (there were relatively few degrees of freedom) along the neuron mode, while all M1 datasets were simplest along the condition mode. These differences could not be inferred from surface-level response features. Formal considerations suggest why tensor structure might differ across modes. For idealized linear models, structure is simplest across the neuron mode when responses reflect external variables, and simplest across the condition mode when responses reflect population dynamics. This same pattern was present for existing models that seek to explain motor cortex responses. Critically, only dynamical models displayed tensor structure that agreed with the empirical M1 data. These results illustrate that tensor structure is a basic feature of the data. For M1 the tensor structure was compatible with only a subset of existing models.Neurosciences, Biostatistics, Visual cortex, Motor cortex, Calculus of tensors, Neuronsjss2219, jpc2181, mc3502Neurobiology and Behavior, Statistics, NeuroscienceArticlesA Robust Model-free Approach for Rare Variants Association Studies Incorporating Gene-Gene and Gene-Environmental Interactions
https://academiccommons.columbia.edu/catalog/ac:203800
Fan, Ruixue; Lo, Shaw-Hwahttp://dx.doi.org/10.7916/D84J0FF0Tue, 01 Nov 2016 13:45:36 +0000Recently more and more evidence suggest that rare variants with much lower minor allele frequencies play significant roles in disease etiology. Advances in next-generation sequencing technologies will lead to many more rare variants association studies. Several statistical methods have been proposed to assess the effect of rare variants by aggregating information from multiple loci across a genetic region and testing the association between the phenotype and aggregated genotype. One limitation of existing methods is that they only look into the marginal effects of rare variants but do not systematically take into account effects due to interactions among rare variants and between rare variants and environmental factors. In this article, we propose the summation of partition approach (SPA), a robust model-free method that is designed specifically for detecting both marginal effects and effects due to gene-gene (G×G) and gene-environmental (G×E) interactions for rare variants association studies. SPA has three advantages. First, it accounts for the interaction information and gains considerable power in the presence of unknown and complicated G×G or G×E interactions. Secondly, it does not sacrifice the marginal detection power; in the situation when rare variants only have marginal effects it is comparable with the most competitive method in current literature. Thirdly, it is easy to extend and can incorporate more complex interactions; other practitioners and scientists can tailor the procedure to fit their own study friendly. Our simulation studies show that SPA is considerably more powerful than many existing methods in the presence of G×G and G×E interactions.Genetics, Allelomorphism, Genotype-environment interactionshl5StatisticsArticlesMeasuring Spatial Extremal Dependence
https://academiccommons.columbia.edu/catalog/ac:202722
Cho, Yong Bumhttp://dx.doi.org/10.7916/D8PR7W8TTue, 11 Oct 2016 18:05:41 +0000The focus of this thesis is extremal dependence among spatial observations. In particular, this research extends the notion of the extremogram to the spatial process setting. Proposed by Davis and Mikosch (2009), the extremogram measures extremal dependence for a stationary time series. The versatility and flexibility of the concept made it well suited for many time series applications including from finance and environmental science.
After defining the spatial extremogram, we investigate the asymptotic properties of the empirical estimator of the spatial extremogram. To this end, two sampling scenarios are considered: 1) observations are taken on the lattice and 2) observations are taken on a continuous region in a continuous space, in which the locations are points of a homogeneous Poisson point process. For both cases, we establish the central limit theorem for the empirical spatial extremogram under general mixing and dependence conditions. A high level overview is as follows. When observations are observed on a lattice, the asymptotic results generalize those obtained in Davis and Mikosch (2009). For non-lattice cases, we define a kernel estimator of the empirical spatial extremogram and establish the central limit theorem provided the bandwidth of the kernel gets smaller and the sampling region grows at proper speeds. We illustrate the performance of the empirical spatial extremogram using simulation examples, and then demonstrate the practical use of our results with a data set of rainfall in Florida and ground-level ozone data in the eastern United States.
The second part of the thesis is devoted to bootstrapping and variance estimation with a view towards constructing asymptotically correct confidence intervals. Even though the empirical spatial extremogram is asymptotically normal, the limiting variance is intractable. We consider three approaches: for lattice data, we use the circular bootstrap adapted to spatial observations, jackknife variance estimation, and subsampling variance estimation. For data sampled according to a Poisson process, we use subsampling methods to estimate the variance of the empirical spatial extremogram. We establish the (conditional) asymptotic normality for the circular block bootstrap estimator for the spatial extremogram and show L2 consistency of the variance estimated by jackknife and subsampling. Then, we propose a portmanteau style test to check the existence of extremal dependences at multiple lags. The validity of confidence intervals produced from these approaches and a portmanteau style test are demonstrated through simulation examples. Finally, we illustrate this methodology to two data sets. The first is the amount of rainfall over a grid of locations in northern Florida. The second is ground-level ozone in the eastern United States, which are recorded on an irregularly spaced set of stations.Statistics, Extremal problems (Mathematics), Spatial analysis (Statistics), Statistics, Bootstrap (Statistics)yc2500StatisticsDissertationsEnlargement of Filtration and the Strict Local Martingale Property in Stochastic Differential Equations
https://academiccommons.columbia.edu/catalog/ac:201869
Dandapani, Aditihttp://dx.doi.org/10.7916/D8XW4JZ2Tue, 02 Aug 2016 12:25:23 +0000In this thesis, we study the strict local martingale property of solutions of various types of stochastic differential equations and the effect of an initial expansion of the filtration on this property. For the models we consider, we either use existing criteria or, in the case where the stochastic differential equation has jumps, develop new criteria that can can detect the presence of the strict local martingale property. We develop deterministic sufficient conditions on the drift and diffusion coefficient of the stochastic process such that an enlargement by initial expansion of the filtration can produce a strict local martingale from a true martingale. We also develop a way of characterizing the martingale property in stochastic volatility models where the local martingale has a general diffusion coefficient.Mathematics, Applied mathematics, Martingales (Mathematics), Stochastic differential equationsad2259Applied Physics and Applied Mathematics, StatisticsDissertationsOn Model-Selection and Applications of Multilevel Models in Survey and Causal Inference
https://academiccommons.columbia.edu/catalog/ac:200369
Wang, Weihttp://dx.doi.org/10.7916/D8571C4QWed, 22 Jun 2016 12:34:52 +0000This thesis includes three parts. The overarching theme is how to analyze multilevel structured datasets, particularly in the areas of survey and causal inference. The first part discusses model selection of hierarchical models, in the context of a national political survey. I found that the commonly used model selection criteria based on predictive accuracy, such as cross validation, don't perform very well in the case of political survey and explore the possible causes. The second part centers around a unique data set on the presidential election collected through an online platform. I show that with adequate modeling, meaningful and highly accurate information could be extracted from this highly-biased data set. The third part builds on a formal causal inference framework for group-structured data, such as meta-analysis and multi-site trials. In particular, I develop a Gaussian Process model under this framework and demonstrate additional insights that can be gained compared with traditional parametric models.Statistics, Social sciences--Statistical methods--Data processingww2243StatisticsDissertationsPulmonary Hyperinflation and Left Ventricular Mass
https://academiccommons.columbia.edu/catalog/ac:199889
Smith, Benjamin; Kawut, Steven M.; Bluemke, David A.; Basner, Robert C.; Gomes, Antoinette S.; Hoffman, Eric; Kalhan, Ravi; Lima, Joao A. C.; Liu, Chia-Ying; Michos, Erin D.; Prince, Martin R.; Rabbani, Leroy E.; Rabinowitz, Daniel; Shimbo, Daichi; Shea, Steven J. C.; Barr, R. Grahamhttp://dx.doi.org/10.7916/D8BR8S99Fri, 10 Jun 2016 19:04:20 +0000Background—Left ventricular (LV) mass is an important predictor of heart failure and cardiovascular mortality, yet determinants of LV mass are incompletely understood. Pulmonary hyperinflation in chronic obstructive pulmonary disease (COPD) may contribute to changes in intrathoracic pressure that increase LV wall stress. We therefore hypothesized that residual lung volume in COPD would be associated with greater LV mass.
Methods and Results—The Multi-Ethnic Study of Atherosclerosis (MESA) COPD Study recruited smokers 50 to 79 years of age who were free of clinical cardiovascular disease. LV mass was measured by cardiac magnetic resonance. Pulmonary function testing was performed according to guidelines. Regression models were used to adjust for age, sex, body size, blood pressure, and other cardiac risk factors. Among 119 MESA COPD Study participants, the mean age was 69±6 years, 55% were male, and 65% had COPD, mostly of mild or moderate severity. Mean LV mass was 128±34 g. Residual lung volume was independently associated with greater LV mass (7.2 g per 1-SD increase in residual volume; 95% confidence interval, 2.2–12; P=0.004) and was similar in magnitude to that of systolic blood pressure (7.6 g per 1-SD increase in systolic blood pressure; 95% confidence interval, 4.3–11; P<0.001). Similar results were observed for the ratio of LV mass to end-diastolic volume (P=0.02) and with hyperinflation measured as residual volume to total lung capacity ratio (P=0.009).
Conclusions—Pulmonary hyperinflation, as measured by residual lung volume or residual lung volume to total lung capacity ratio, is associated with greater LV mass.Health sciences, Epidemiology, Medicine, Heart--Left ventricle, Heart failure, Lungs--Diseases, Obstructivebs2723, rcb42, mrp2102, ler8, dr105, ds2231, ss35, rgb9Medicine, Radiology, Statistics, Center for Behavioral Cardiovascular HealthArticlesAsymptotic Theory and Applications of Random Functions
https://academiccommons.columbia.edu/catalog/ac:198322
Li, Xiaoouhttp://dx.doi.org/10.7916/D8QF8SW7Tue, 03 May 2016 09:21:26 +0000Random functions is the central component in many statistical and probabilistic problems. This dissertation presents theoretical analysis and computation for random functions and its applications in statistics. This dissertation consists of two parts. The first part is on the topic of classic continuous random fields. We present asymptotic analysis and computation for three non-linear functionals of random fields. In Chapter 1, we propose an efficient Monte Carlo algorithm for computing P{sup_T f(t)>b} when b is large, and f is a Gaussian random field living on a compact subset T. For each pre-specified relative error ɛ, the proposed algorithm runs in a constant time for an arbitrarily large $b$ and computes the probability with the relative error ɛ. In Chapter 2, we present the asymptotic analysis for the tail probability of ∫_T e^{σf(t)+μ(t)}dt under the asymptotic regime that σ tends to zero. In Chapter 3, we consider partial differential equations (PDE) with random coefficients, and we develop an unbiased Monte Carlo estimator with finite variance for computing expectations of the solution to random PDEs. Moreover, the expected computational cost of generating one such estimator is finite. In this analysis, we employ a quadratic approximation to solve random PDEs and perform precise error analysis of this numerical solver. The second part of this dissertation focuses on topics in statistics. The random functions of interest are likelihood functions, whose maximum plays a key role in statistical inference. We present asymptotic analysis for likelihood based hypothesis tests and sequential analysis. In Chapter 4, we derive an analytical form for the exponential decay rate of error probabilities of the generalized likelihood ratio test for testing two general families of hypotheses. In Chapter 5, we study asymptotic properties of the generalized sequential probability ratio test, the stopping rule of which is the first boundary crossing time of the generalized likelihood ratio statistic. We show that this sequential test is asymptotically optimal in the sense that it achieves asymptotically the shortest expected sample size as the maximal type I and type II error probabilities tend to zero. These results have important theoretical implications in hypothesis testing, model selection, and other areas where maximum likelihood is employed.Statistics, Mathematical statistics, Monte Carlo method, Differential equations, Partial, Differential equations, Partial--Asymptotic theoryxl2306StatisticsDissertationsSpectral Filtering for Spatio-temporal Dynamics and Multivariate Forecasts
https://academiccommons.columbia.edu/catalog/ac:198310
Meng, Luhttp://dx.doi.org/10.7916/D80Z7385Tue, 03 May 2016 09:20:21 +0000Due to the increasing availability of massive spatio-temporal data sets, modeling high dimensional data becomes quite challenging. A large number of research questions are rooted in identifying underlying dynamics in such spatio-temporal data. For many applications, the science suggests that the intrinsic dynamics be smooth and of low dimension. To reduce the variance of estimates and increase the computational tractability, dimension reduction is also quite necessary in the modeling procedure. In this dissertation, we propose a spectral filtering approach for dimension reduction and forecast amelioration, and apply it to multiple applications. We show the effectiveness of dimension reduction via our method and also illustrate its power for prediction in both simulation and real data examples. The resultant lower dimensional principal component series has a diagonal spectral density at each frequency whose diagonal elements are in descending order, which is not well motivated can be hard to interpret. Therefore we propose a phase-based filtering method to create principal component series with interpretable dynamics in the time domain. Our method is based on an approach of structural decomposition and phase-aligned construction in the frequency domain, identifying lower-rank dynamics and its components embedded in a high dimensional spatio-temporal system. In both our simulated examples and real data applications, we illustrate that the proposed method is able to separate and identify meaningful lower-rank movements. Benefiting from the zero-coherence property of the principal component series, we subsequently develop a predictive model for high-dimensional forecasting via lower-rank dynamics. Our modeling approach reduces multivariate modeling task to multiple univariate modeling and is flexible in combining with regularization techniques to obtain more stable estimates and improve interpretability. The simulation results and real data analysis show that our model achieves superior forecast performance compared to the class of autoregressive models.Statistics, Statistics, Mathematical statistics--Data processing, Dynamics, Dimension reduction (Statistics)lm2844StatisticsDissertationsLatent Variable Modeling and Statistical Learning
https://academiccommons.columbia.edu/catalog/ac:198122
Chen, Yunxiaohttp://dx.doi.org/10.7916/D8PV6KBNFri, 29 Apr 2016 21:15:12 +0000Latent variable models play an important role in psychological and educational measurement, which attempt to uncover the underlying structure of responses to test items. This thesis focuses on the development of statistical learning methods based on latent variable models, with applications to psychological and educational assessments. In that connection, the following problems are considered.
The first problem arises from a key assumption in latent variable modeling, namely the local independence assumption, which states that given an individual's latent variable (vector), his/her responses to items are independent. This assumption is likely violated in practice, as many other factors, such as the item wording and question order, may exert additional influence on the item responses. Any exploratory analysis that relies on this assumption may result in choosing too many nuisance latent factors that can neither be stably estimated nor reasonably interpreted. To address this issue, a family of models is proposed that relax the local independence assumption by combining the latent factor modeling and graphical modeling. Under this framework, the latent variables capture the across-the-board dependence among the item responses, while a second graphical structure characterizes the local dependence. In addition, the number of latent factors and the sparse graphical structure are both unknown and learned from data, based on a statistically solid and computationally efficient method.
The second problem is to learn the relationship between items and latent variables, a structure that is central to multidimensional measurement. In psychological and educational assessments, this relationship is typically specified by experts when items are written and is incorporated into the model without further verification after data collection. Such a non-empirical approach may lead to model misspecification and substantial lack of model fit, resulting in erroneous interpretation of assessment results. Motivated by this, I consider to learn the item - latent variable relationship based on data. It is formulated as a latent variable selection problem, for which theoretical analysis and a computationally efficient algorithm are provided.Statistics, Latent variables, Educational tests and measurements--Statistical methods, Psychological tests--Statistical methods, Learning, Psychology of--Mathematical modelsyc2710StatisticsDissertationsAdvances in Model Selection Techniques with Applications to Statistical Network Analysis and Recommender Systems
https://academiccommons.columbia.edu/catalog/ac:198116
Franco Saldana, Diegohttp://dx.doi.org/10.7916/D8GB2424Fri, 29 Apr 2016 21:14:50 +0000This dissertation focuses on developing novel model selection techniques, the process by which a statistician selects one of a number of competing models of varying dimensions, under an array of different statistical assumptions on observed data. Traditionally, two main reasons have been advocated by researchers for performing model selection strategies over classical maximum likelihood estimates (MLEs). The first reason is prediction accuracy, where by shrinking or setting to zero some model parameters, one sacrifices the unbiasedness of MLEs for a reduced variance, which in turn leads to an overall improvement in predictive performance. The second reason relates to interpretability of the selected models in the presence of a large number of predictors, where in order to obtain a parsimonious representation exhibiting the relationship between the response and covariates, we are willing to sacrifice some of the smaller details brought in by spurious predictors.
In the first part of this work, we revisit the family of variable selection techniques known as sure independence screening procedures for generalized linear models and the Cox proportional hazards model. After clever combination of some of its most powerful variants, we propose new extensions based on the idea of sample splitting, data-driven thresholding, and combinations thereof. A publicly available package developed in the R statistical software demonstrates considerable improvements in terms of model selection and competitive computational time between our enhanced variable selection procedures and traditional penalized likelihood methods applied directly to the full set of covariates.
Next, we develop model selection techniques within the framework of statistical network analysis for two frequent problems arising in the context of stochastic blockmodels: community number selection and change-point detection. In the second part of this work, we propose a composite likelihood based approach for selecting the number of communities in stochastic blockmodels and its variants, with robustness consideration against possible misspecifications in the underlying conditional independence assumptions of the stochastic blockmodel. Several simulation studies, as well as two real data examples, demonstrate the superiority of our composite likelihood approach when compared to the traditional Bayesian Information Criterion or variational Bayes solutions. In the third part of this thesis, we extend our analysis on static network data to the case of dynamic stochastic blockmodels, where our model selection task is the segmentation of a time-varying network into temporal and spatial components by means of a change-point detection hypothesis testing problem. We propose a corresponding test statistic based on the idea of data aggregation across the different temporal layers through kernel-weighted adjacency matrices computed before and after each candidate change-point, and illustrate our approach on synthetic data and the Enron email corpus.
The matrix completion problem consists in the recovery of a low-rank data matrix based on a small sampling of its entries. In the final part of this dissertation, we extend prior work on nuclear norm regularization methods for matrix completion by incorporating a continuum of penalty functions between the convex nuclear norm and nonconvex rank functions. We propose an algorithmic framework for computing a family of nonconvex penalized matrix completion problems with warm-starts, and present a systematic study of the resulting spectral thresholding operators. We demonstrate that our proposed nonconvex regularization framework leads to improved model selection properties in terms of finding low-rank solutions with better predictive performance on a wide range of synthetic data and the famous Netflix data recommender system.Statistics, Statistics, Linear models (Statistics), Probabilities, Proportional hazards modelsdf2406StatisticsDissertationsAcupuncture point injection treatment of primary dysmenorrhoea: a randomised, double blind, controlled study
https://academiccommons.columbia.edu/catalog/ac:196923
Wade, Christine; Wang, L.; Zhao, W. J.; Cardini, F.; Kronenberg, Fredi; Gui, S. Q.; Ying, Zhu; Zhao, N. Q.; Chao, M. T.; Yu, J.http://dx.doi.org/10.7916/D80R9P9BThu, 31 Mar 2016 12:06:15 +0000Objective: To determine if injection of vitamin K3 in an acupuncture point is optimal for the treatment of primary dysmenorrhoea, when compared with 2 other injection treatments.
Setting: A Menstrual Disorder Centre at a public hospital in Shanghai, China. Participants: Chinese women aged 14–25 years with severe primary dysmenorrhoea for at least 6 months not relieved by any other treatment were recruited. Exclusion criteria were the use of oral contraceptives, intrauterine devices or anticoagulant drugs, pregnancy, history of abdominal surgery, participation in other therapies for pain and diagnosis of secondary dysmenorrhoea. Eighty patients with primary dysmenorrhoea, as defined on a 4-grade scale, completed the study. Two patients withdrew after randomisation. Interventions: A double-blind, double-dummy, randomised controlled trial compared vitamin K3 acupuncture point injection to saline acupuncture point injection and vitamin K3 deep muscle injection. Patients in each group received 3 injections at a single treatment visit.
Primary and secondary outcome measures: The primary outcome was the difference in subjective perception of pain as measured by an 11 unit Numeric Rating Scale (NRS). Secondary measurements were Cox Pain Intensity and Duration scales and the consumption of analgesic tablets before and after treatment and during 6 following cycles.
Results: Patients in all 3 groups experienced pain relief from the injection treatments. Differences in NRS measured mean pain scores between the 2 active control groups were less than 1 unit (−0.71, CI −1.37 to −0.05) and not significant, but the differences in average scores between the treatment hypothesised to be optimal and both active control groups (1.11, CI 0.45 to 1.78) and (1.82, CI 1.45 to 2.49) were statistically significant in adjusted mixed-effects models. Menstrual distress and use of analgesics were diminished for 6 months post-treatment. Conclusions: Acupuncture point injection of vitamin K3 relieves menstrual pain rapidly and is a useful treatment in an urban outpatient clinic.Pathology, Alternative medicine, Obstetrics and gynecology, Public health, Vitamin K--Therapeutic use, Acupuncture points, Dysmenorrhea, Clinical trialscmw2, fk11, yz2896Epidemiology, College of Physicians and Surgeons, StatisticsArticlesNew perspectives on learning, inference, and control in brains and machines
https://academiccommons.columbia.edu/catalog/ac:196425
Merel, Joshua Scotthttp://dx.doi.org/10.7916/D8C8296CWed, 16 Mar 2016 18:35:32 +0000The work presented in this thesis provides new perspectives and approaches for problems that arise in the analysis of neural data. Particular emphasis is placed on parameter fitting and automated analysis problems that would arise naturally in closed-loop experiments. Part one focuses on two brain-computer interface problems. First, we provide a framework for understanding co-adaptation, the setting in which decoder updating and user learning occur simultaneously. We also provide a new perspective on intention-based parameter fitting and tools to extend this approach to higher dimensional decoders. Part two focuses on event inference, which refers to the decomposition of observed timeseries data into interpretable events. We present application of event inference methods on voltage-clamp recordings as well as calcium imaging, and describe extensions to allow for combining data across modalities or trials.Neurosciences, Statistics, Neural circuitry, Machine learning, Human-machine systems, Neural networks (Computer science)--Statistical methods, Brain-computer interfacesjsm2183Neurobiology and Behavior, StatisticsDissertationsMethods for Personalized and Evidence Based Medicine
https://academiccommons.columbia.edu/catalog/ac:195007
Shahn, Zachhttp://dx.doi.org/10.7916/D8M0458SWed, 24 Feb 2016 21:14:26 +0000There is broad agreement that medicine ought to be `evidence based' and `personalized' and that data should play a large role in achieving both these goals. But the path from data to improved medical decision making is not clear. This thesis presents three methods that hopefully help in small ways to clear the path.
Personalized medicine depends almost entirely on understanding variation in treatment effect. Chapter 1 describes latent class mixture models for treatment effect heterogeneity that distinguish between continuous and discrete heterogeneity, use hierarchical shrinkage priors to mitigate overfitting and multiple comparisons concerns, and employ flexible error distributions to improve robustness. We apply different versions of these models to reanalyze a clinical trial comparing HIV treatments and a natural experiment on the effect of Medicaid on emergency department utilization.
Medical decisions often depend on observational studies performed on large longitudinal health insurance claims databases. These studies usually claim to identify a causal effect, but empirical evaluations have demonstrated that standard methods for causal discovery perform poorly in this context, most likely in large part due to the presence of unobserved confounding. Chapter 2 proposes an algorithm called Ensembles of Granger Graphs (EGG) that does not rely on the assumption that unobserved confounding is absent. In a simulation and experiments on a real claims database, EGG is robust to confounding, has high positive predictive value, and has high power to detect strong causal effects.
While decision making inherently involves causal inference, purely predictive models aid many medical decisions in practice. Predictions from health histories are challenging because the space of possible predictors is so vast. Not only are there thousands of health events to consider, but also their temporal interactions. In Chapter 3, we adapt a method originally developed for speech recognition that greedily constructs informative labeled graphs representing temporal relations between multiple health events at the nodes of randomized decision trees. We use this method to predict strokes in patients with atrial fibrillation using data from a Medicaid claims database.
I hope the ideas illustrated in these three projects inspire work that someday genuinely improves healthcare. I also include a short `bonus' chapter on an improved estimate of effective sample size in importance sampling. This chapter is not directly related to medicine, but finds a home in this thesis nonetheless.Statistics, Medical care--Statistics, Evidence-based medicine, Personalized medicinezss2101StatisticsDissertationsA Generalizable Brain-Computer Interface (BCI) Using Machine Learning for Feature Discovery
https://academiccommons.columbia.edu/catalog/ac:192916
Nurse, Ewan S.; Karoly, Philippa J.; Grayden, David B.; Freestone, Dean R.http://dx.doi.org/10.7916/D8KS6R9NTue, 12 Jan 2016 16:08:36 +0000This work describes a generalized method for classifying motor-related neural signals for a brain-computer interface (BCI), based on a stochastic machine learning method. The method differs from the various feature extraction and selection techniques employed in many other BCI systems. The classifier does not use extensive a-priori information, resulting in reduced reliance on highly specific domain knowledge. Instead of pre-defining features, the time-domain signal is input to a population of multi-layer perceptrons (MLPs) in order to perform a stochastic search for the best structure. The results showed that the average performance of the new algorithm outperformed other published methods using the Berlin BCI IV (2008) competition dataset and was comparable to the best results in the Berlin BCI II (2002–3) competition dataset. The new method was also applied to electroencephalography (EEG) data recorded from five subjects undertaking a hand squeeze task and demonstrated high levels of accuracy with a mean classification accuracy of 78.9% after five-fold cross-validation. Our new approach has been shown to give accurate results across different motor tasks and signal types as well as between subjects.Neurosciences, Neural networks (Computer science), Brain-computer interfaces, Electroencephalography--Computer programs, Machine learning, Neurons, Neural networks (Computer science)StatisticsArticlesHuman and Machine Learning in Non-Markovian Decision Making
https://academiccommons.columbia.edu/catalog/ac:192871
Clarke, Aaron Michael; Friedrich, Johannes; Tartaglia, Elisa M.; Herzog, Michael H.; Marchesotti, Silvia; Senn, Walterhttp://dx.doi.org/10.7916/D8G44Q1DMon, 11 Jan 2016 15:02:09 +0000Humans can learn under a wide variety of feedback conditions. Reinforcement learning (RL), where a series of rewarded decisions must be made, is a particularly important type of learning. Computational and behavioral studies of RL have focused mainly on Markovian decision processes, where the next state depends on only the current state and action. Little is known about non-Markovian decision making, where the next state depends on more than the current state and action. Learning is non-Markovian, for example, when there is no unique mapping between actions and feedback. We have produced a model based on spiking neurons that can handle these non-Markovian conditions by performing policy gradient descent. Here, we examine the model’s performance and compare it with human learning and a Bayes optimal reference, which provides an upper-bound on performance. We find that in all cases, our population of spiking neurons model well-describes human performance.Education, Behavioral sciences, Markov processes, Learning strategies, Reinforcement learning, Neurons, Decision making, Decision making--Mathematical models, Markov processes--Mathematical modelsjf2954StatisticsArticlesDistributed Bayesian Computation and Self-Organized Learning in Sheets of Spiking Neurons with Local Lateral Inhibition
https://academiccommons.columbia.edu/catalog/ac:192253
Bill, Johannes; Buesing, Lars; Habenschuss, Stefan; Nessler, Bernhard; Maass, Wolfgang; Legenstein, Roberthttp://dx.doi.org/10.7916/D8862G4XMon, 14 Dec 2015 10:04:07 +0000During the last decade, Bayesian probability theory has emerged as a framework in cognitive science and neuroscience for describing perception, reasoning and learning of mammals. However, our understanding of how probabilistic computations could be organized in the brain, and how the observed connectivity structure of cortical microcircuits supports these calculations, is rudimentary at best. In this study, we investigate statistical inference and self-organized learning in a spatially extended spiking network model, that accommodates both local competitive and large-scale associative aspects of neural information processing, under a unified Bayesian account. Specifically, we show how the spiking dynamics of a recurrent network with lateral excitation and local inhibition in response to distributed spiking input, can be understood as sampling from a variational posterior distribution of a well-defined implicit probabilistic model. This interpretation further permits a rigorous analytical treatment of experience-dependent plasticity on the network level. Using machine learning theory, we derive update rules for neuron and synapse parameters which equate with Hebbian synaptic and homeostatic intrinsic plasticity rules in a neural implementation. In computer simulations, we demonstrate that the interplay of these plasticity rules leads to the emergence of probabilistic local experts that form distributed assemblies of similarly tuned cells communicating through lateral excitatory connections. The resulting sparse distributed spike code of a well-adapted network carries compressed information on salient input features combined with prior experience on correlations among them. Our theory predicts that the emergence of such efficient representations benefits from network architectures in which the range of local inhibition matches the spatial extent of pyramidal cells that share common afferent input.Neurosciences, Molecular biology, Statistics, Bayesian statistical decision theory, Neurons, Neuroplasticity, InhibitionStatisticsArticlesAn Assortment of Unsupervised and Supervised Applications to Large Data
https://academiccommons.columbia.edu/catalog/ac:189937
Agne, Michael Roberthttp://dx.doi.org/10.7916/D828073NThu, 15 Oct 2015 18:08:52 +0000This dissertation presents several methods that can be applied to large datasets with an enormous number of covariates. It is divided into two parts. In the first part of the dissertation, a novel approach to pinpointing sets of related variables is introduced. In the second part, several new methods and modifications of current methods designed to improve prediction are outlined. These methods can be considered extensions of the very successful I Score suggested by Lo and Zheng in a 2002 paper and refined in many papers since.
In Part I, unsupervised data (with no response) is addressed. In chapter 2, the novel unsupervised I score and its associated procedure are introduced and some of its unique theoretical properties are explored. In chapter 3, several simulations consisting of generally hard-to-wrangle scenarios demonstrate promising behavior of the approach. The method is applied to the complex field of market basket analysis, with a specific grocery data set used to show it in action in chapter 4. It is compared it to a natural competition, the A Priori algorithm. The main contribution of this part of the dissertation is the unsupervised I score, but we also suggest several ways to leverage the variable sets the I score locates in order to mine for association rules.
In Part II, supervised data is confronted. Though the I Score has been used in reference to these types of data in the past, several interesting ways of leveraging it (and the modules of covariates it identifies) are investigated. Though much of this methodology adopts procedures which are individually well-established in literature, the contribution of this dissertation is organization and implementation of these methods in the context of the I Score. Several module-based regression and voting methods are introduced in chapter 7, including a new LASSO-based method for optimizing voting weights. These methods can be considered intuitive and readily applicable to a huge number of datasets of sometimes colossal size. In particular, in chapter 8, a large dataset on Hepatitis and another on Oral Cancer are analyzed. The results for some of the methods are quite promising and competitive with existing methods, especially with regard to prediction. A flexible and multifaceted procedure is suggested in order to provide a thorough arsenal when dealing with the problem of prediction in these complex data sets.
Ultimately, we highlight some benefits and future directions of the method.Statistics, Biostatisticsmra2110StatisticsDissertationsEfficiency in Lung Transplant Allocation Strategies
https://academiccommons.columbia.edu/catalog/ac:187899
Zou, Jingjinghttp://dx.doi.org/10.7916/D8QV3KKZTue, 12 May 2015 18:28:18 +0000Currently in the United States, lungs are allocated to transplant candidates based on the Lung Allocation Score (LAS). The LAS is an empirically derived score aimed at increasing total life span pre- and post-transplantation, for patients on lung transplant waiting lists. The goal here is to develop efficient allocation strategies in the context of lung transplantation.
In this study, patient and organ arrivals to the waiting list are modeled as independent homogeneous Poisson processes. Patients' health status prior to allocations are modeled as evolving according to independent and identically distributed finite-state inhomogeneous Markov processes, in which death is treated as an absorbing state. The expected post-transplantation residual life is modeled as depending on time on the waiting list and on current health status. For allocation strategies satisfying certain minimal fairness requirements, the long-term limit of expected average total life exists, and is used as the standard for comparing allocation strategies.
Via the Hamilton-Jacobi-Bellman equations, upper bounds as a function of the ratio of organ arrival rate to the patient arrival rate for the long-term expected average total life are derived, and corresponding to each upper bound is an allocable set of (state, time) pairs at which patients would be optimally transplanted. As availability of organs increases, the allocable set expands monotonically, and ranking members of the waiting list according to the availability at which they enter the allocable set provides an allocation strategy that leads to long-term expected average total life close to the upper bound.
Simulation studies are conducted with model parameters estimated from national lung transplantation data from United Network for Organ Sharing (UNOS). Results suggest that compared to the LAS, the proposed allocation strategy could provide a 7% increase in average total life.Statisticsjz2335StatisticsDissertationsA Graphon-based Framework for Modeling Large Networks
https://academiccommons.columbia.edu/catalog/ac:200607
He, Ranhttp://dx.doi.org/10.7916/D8MC8Z3CMon, 11 May 2015 15:34:34 +0000This thesis focuses on a new graphon-based approach for fitting models to large networks and establishes a general framework for incorporating nodal attributes to modeling. The scale of network data nowadays, renders classical network modeling and inference inappropriate. Novel modeling strategies are required as well as estimation methods.
Depending on whether the model structure is specified a priori or solely determined from data, existing models for networks can be classified as parametric and non-parametric. Compared to the former, a non-parametric model often allows for an easier and more straightforward estimation procedure of the network structure. On the other hand, the connectivities and dynamics of networks fitted by non-parametric models can be quite difficult to interpret, as compared to parametric models.
In this thesis, we first propose a computational estimation procedure for a class of parametric models that are among the most widely used models for networks, built upon tools from non-parametric models with practical innovations that make it efficient and capable of scaling to large networks.
Extensions of this base method are then considered in two directions. Inspired by a popular network sampling method, we further propose an estimation algorithm using sampled data, in order to circumvent the practical obstacle that the entire network data is hard to obtain and analyze. The base algorithm is also generalized to consider the case of complex network structure where nodal attributes are involved. Two general frameworks of a non-parametric model are proposed in order to incorporate nodal impact, one with a hierarchical structure, and the other employs similarity measures.
Several simulation studies are carried out to illustrate the improved performance of our proposed methods over existing algorithms. The proposed methods are also applied to several real data sets, including Slashdot online social networks and in-school friendship networks from the National Longitudinal Study of Adolescent to Adult Health (AddHealth Study). An array of graphical visualizations and quantitative diagnostic tools, which are specifically designed for the evaluation of goodness of fit for network models, are developed and illustrated with these data sets. Some observations of using these tools via our algorithms are also examined and discussed.Statistics, Network analysis (Planning)--Mathematical models, Statistics, AlgorithmsStatisticsDissertationsGLMLE: graph-limit enabled fast computation for fitting exponential random graph models to large social networks
https://academiccommons.columbia.edu/catalog/ac:185410
He, Ran; Zheng, Tianhttp://dx.doi.org/10.7916/D8S46QVQThu, 02 Apr 2015 14:49:11 +0000Large network, as a form of big data, has received increasing amount of attention in data science, especially for large social network, which is reaching the size of hundreds of millions, with daily interactions on the scale of billions. Thus analyzing and modeling these data to understand the connectivities and dynamics of large networks is important in a wide range of scientific fields. Among popular models, exponential random graph models (ERGMs) have been developed to study these complex networks by directly modeling network structures and features. ERGMs, however, are hard to scale to large networks because maximum likelihood estimation of parameters in these models can be very difficult, due to the unknown normalizing constant. Alternative strategies based on Markov chain Monte Carlo (MCMC) draw samples to approximate the likelihood, which is then maximized to obtain the maximum likelihood estimators (MLE). These strategies have poor convergence due to model degeneracy issues and cannot be used on large networks. Chatterjee et al. (Ann Stat 41:2428–2461, 2013) propose a new theoretical framework for estimating the parameters of ERGMs by approximating the normalizing constant using the emerging tools in graph theory—graph limits. In this paper, we construct a complete computational procedure built upon their results with practical innovations which is fast and is able to scale to large networks. More specifically, we evaluate the likelihood via simple function approximation of the corresponding ERGM’s graph limit and iteratively maximize the likelihood to obtain the MLE. We also discuss the methods of conducting likelihood ratio test for ERGMs as well as related issues. Through simulation studies and real data analysis of two large social networks, we show that our new method outperforms the MCMC-based method, especially when the network size is large (more than 100 nodes). One limitation of our approach, inherited from the limitation of the result of Chatterjee et al. (Ann Stat 41:2428–2461, 2013), is that it works only for sequences of graphs with a positive limiting density, i.e., dense graphs.Statisticsrh2528, tz33StatisticsArticlesSurveying Hard-to-Reach Groups Through Sampled Respondents in a Social Network
https://academiccommons.columbia.edu/catalog/ac:185373
McCormick, Tyler H.; Zheng, Tian; He, Ran; Kolaczyk, Erichttp://dx.doi.org/10.7916/D8Z0372NTue, 31 Mar 2015 12:36:09 +0000The sampling frame in most social science surveys misses members of certain groups, such as the homeless or individuals living with HIV. These groups are known as hard-to-reach groups. One strategy for learning about these groups, or subpopulations, involves reaching hard-to-reach group members through their social network. In this paper we compare the efficiency of two common methods for subpopulation size estimation using data from standard surveys. These designs are examples of mental link tracing designs. These designs begin with a randomly sampled set of network members (nodes) and then reach other nodes indirectly through questions asked to the sampled nodes. Mental link tracing designs cost significantly less than traditional link tracing designs, yet introduce additional sources of potential bias. We examine the influence of one such source of bias using simulation studies. We then demonstrate our findings using data from the General Social Survey collected in 2004 and 2006. Additionally, we provide survey design suggestions for future surveys incorporating such designs.Statistics, Social researchtz33, rh2528StatisticsArticlesA Practical Guide to Measuring Social Structure Using Indirectly Observed Network Data
https://academiccommons.columbia.edu/catalog/ac:185370
McCormick, Tyler H.; Moussa, Amal; DiPrete, Thomas A.; Ruf, Johannes; Gelman, Andrew E.; Teitler, Julien O.; Zheng, Tianhttp://dx.doi.org/10.7916/D86H4G9DTue, 31 Mar 2015 12:16:05 +0000Aggregated relational data (ARD) are an increasingly common tool for learning about social networks through standard surveys. Recent statistical advances present social scientists with new options for analyzing such data. In this article, we propose guidelines for learning about various network processes using ARD and a template to aid practitioners. We first propose that ARD can be used to measure “social distance” between a respondent and a subpopulation (individuals named Kevin, those in prison, or those serving in the military). We then present common methods for analyzing these data and associate each of these methods with a specific way of measuring social distance, thus associating statistical tools with their underlying social science phenomena. We examine the implications of using each of these social distance measures using an Internet survey about contemporary political issues.Statistics, Social researchtad61, ag389, jot8, tz33Sociology, Statistics, Social WorkArticlesHow many people do you know?: Efficiently estimating personal network size
https://academiccommons.columbia.edu/catalog/ac:185367
Zheng, Tian; Salganik, Matthew J.; McCormick, Tyler H.http://dx.doi.org/10.7916/D8FX78BTTue, 31 Mar 2015 12:04:01 +0000In this paper we develop a method to estimate both individual social network size (i.e., degree) and the distribution of network sizes in a population by asking respondents how many people they know in specific subpopulations (e.g., people named Michael). Building on the scale-up method of Killworth et al. and other previous attempts to estimate individual network size, we propose a latent non-random mixing model which resolves three known problems with previous approaches. As a byproduct, our method also provides estimates of the rate of social mixing between population groups. We demonstrate the model using a sample of 1,370 adults originally collected by McCarty et al. (2001). Based on insights developed during the statistical modeling, we conclude by offering practical guidelines for the design of future surveys to estimate social network size. Most importantly, we show that if the first names to be asked about are chosen properly, the simple scale-up degree estimates can enjoy the same bias-reduction as that from the our more complex latent non-random mixing model.Statistics, Social researchtz33StatisticsArticlesHow Many People Do You Know in Prison? Using Overdispersion in Count Data to Estimate Social Structure in Networks
https://academiccommons.columbia.edu/catalog/ac:185364
Zheng, Tian; Salganik, Matthew J.; Gelman, Andrew E.http://dx.doi.org/10.7916/D800011WMon, 30 Mar 2015 13:54:20 +0000Networks—sets of objects connected by relationships—are important in a number of fields. The study of networks has long been central to sociology, where researchers have attempted to understand the causes and consequences of the structure of relationships in large groups of people. Using insight from previous network research, Killworth et al. and McCarty et al. have developed and evaluated a method for estimating the sizes of hard-to-count populations using network data collected from a simple random sample of Americans. In this article we show how, using a multilevel overdispersed Poisson regression model, these data also can be used to estimate aspects of social structure in the population. Our work goes beyond most previous research on networks by using variation, as well as average responses, as a source of information. We apply our method to the data of McCarty et al. and find that Americans vary greatly in their number of acquaintances. Further, Americans show great variation in propensity to form ties to people in some groups (e.g., males in prison, the homeless, and American Indians), but little variation for other groups (e.g., twins, people named Michael or Nicole). We also explore other features of these data and consider ways in which survey data can be used to estimate network structure.Statistics, Social researchtz33, ag389Statistics, Political ScienceArticlesBackward Haplotype Transmission Association (BHTA) Algorithm-A Fast Multiple-Marker Screening Method
https://academiccommons.columbia.edu/catalog/ac:185361
Lo, Shaw-Hwa ; Zheng, Tianhttp://dx.doi.org/10.7916/D87D2T2XMon, 30 Mar 2015 13:36:26 +0000The mapping of complex traits is one of the most important and central areas of human genetics today. Recent attention has been focused on genome scans using a large number of marker loci. Because complex traits are typically caused by multiple genes, the common approaches of mapping them by testing markers one after another fail to capture the substantial information of interactions among disease loci. Here we propose a backward haplotype transmission association (BHTA) algorithm to address this problem. The algorithm can administer a screening on any disease model when case- parent trio data are available. It identifies the important subset of an original larger marker set by eliminating the markers of least significance, one at a time, after a complete evaluation of its importance. In contrast with the existing methods, three major advantages emerge from this approach. First, it can be applied flexibly to arbitrary markers, regardless of their locations. Second, it takes into account haplotype information; it is more powerful in detecting the multifactorial traits in the presence of haplotypic association. Finally, the proposed method can potentially prove to be more efficient in future. genome wide scans, in terms of greater accuracy of gene detection and substantially reduced number of tests required in scans. We illustrate the performance of the algorithm with several examples, including one real data set with 31 markers for a study on the Gilles de la Tourette syndrome. Detailed theoretical justifications are also included, which explains why the algorithm is likely to select the ‘correct’ markers.Biostatistics, Geneticsshl5, tz33Statistics, BiostatisticsArticlesBackward Genotype-Trait Association (BGTA)-Based Dissection of Complex Traits in Case-Control Designs
https://academiccommons.columbia.edu/catalog/ac:185325
Zheng, Tian; Wang, Hui; Lo, Shaw-Hwahttp://dx.doi.org/10.7916/D8SF2V33Mon, 30 Mar 2015 12:12:55 +0000Background: The studies of complex traits project new challenges to current methods that evaluate association between genotypes and a specific trait. Consideration of possible interactions among loci leads to overwhelming dimensions that cannot be handled using current statistical methods. Methods: In this article, we evaluate a multi-marker screening algorithm--the backward genotype-trait association (BGTA) algorithm for case-control designs, which uses unphased multi-locus genotypes. BGTA carries out a global investigation on a candidate marker set and automatically screens out markers carrying diminutive amounts of information regarding the trait in question. To address the "too many possible genotypes, too few informative chromosomes" dilemma of a genomic-scale study that consists of hundreds to thousands of markers, we further investigate a BGTA-based marker selection procedure, in which the screening algorithm is repeated on a large number of random marker subsets. Results of these screenings are then aggregated into counts that the markers are retained by the BGTA algorithm. Markers with exceptional high counts of returns are selected for further analysis. Results and Conclusion: Evaluated using simulations under several disease models, the proposed methods prove to be more powerful in dealing with epistatic traits.We also demonstrate the proposed methods through an application to a study on the inflammatory bowel disease.Statistics, Genetics, Biostatisticstz33, hw2334, shl5Statistics, Microbiology and Immunology, BiostatisticsArticlesDiscovering interactions among BRCA1 and other candidate genes associated with sporadic breast cancer
https://academiccommons.columbia.edu/catalog/ac:184992
Lo, Shaw-Hwa; Chernoff, Herman; Cong, Lei; Ding, Yuejing; Zheng, Tianhttp://dx.doi.org/10.7916/D8CC0ZKFSat, 28 Mar 2015 15:43:57 +0000Analysis of a subset of case-control sporadic breast cancer data, [from the National Cancer Institute's Cancer Genetic Markers of Susceptibility (CGEMS) initiative], focusing on 18 breast cancer-related genes with 304 SNPs, indicates that there are many interesting interactions that form two- and three-way networks in which BRCA1 plays a dominant and central role. The apparent interactions of BRCA1 with many other genes suggests the conjecture that BRCA1 serves as a protective gene and that some mutations in it or in related genes may prevent it from carrying out this protective function even if the patients are not carriers of known cancer-predisposing BRCA1 mutations. The method of analysis features the evaluation of the effect of a gene by averaging the effects of the SNPs covered by that gene. Marginal methods that test one gene at a time fail to show any effect. That may be related to the fact that each of these 18 genes adds very little to the risk of cancer. Analysis that relates the ratio of interactions to the maximum of the first-order effects discovers significant gene pairs and triplets.
Breast cancer (MIM 114480) has complex causes. Known predisposition genes explain <15% of the breast cancer cases. It is generally believed that most sporadic breast cancers are triggered by unknown combined effects, possibly because of a large number of genes and other risk factors, each adding a small risk toward cancer etiology. Progress in seeking breast cancer genes other than BRCA1 and BRCA2 has been slow and limited because the individual risk due to each gene is small. This difficulty may be partly due to the fact that current methods rely largely on marginal information from genes studied one at a time and ignore potentially valuable information because of the interaction among multiple loci. Because each responsible gene may have a small marginal effect in causing disease, it is likely that such methods will fail to capture many responsible genes by studying a dataset where the disease may be due to a variety of different sources. The possible presence of many genes responsible for different subgroups of cancer patients may reduce the power of current methods to detect genes partly responsible for some forms of breast cancer. It is believed that methods effective in extracting interactive information from data should be developed.
What should be done when marginal effects are too weak to be detected? Our methods use interactive information from multiple sites as well as marginal information, They provide power to detect interactive genes. To test this claim and to demonstrate the practical value of these methods in real applications, we apply them to an important study: a subset of a large dataset collected from a case-control sporadic breast cancer study, focusing on gene–gene-based analysis. This partial dataset comprises 18 genes with 304 SNP markers. The application results in a number of scientific findings.
The message of this article is fourfold. First, if marginal methods fail, more powerful methods that take into account interactive information can be used effectively. We apply our proposed methods to this dataset to illustrate the detection of the interactions between genes. We point out that in our findings, none of the 18 selected genes show any detectable marginal effects that are significantly higher than those generated by random fluctuations. In other words, all of the 18 genes would be missed if only marginal methods were used.
Second, we demonstrate how to carry out a gene-based analysis by treating each gene as a basic unit while incorporating relevant information from all SNPs within that gene. Two summary test scores are proposed to quantify the strength of interactions for each pair of genes. The pairwise interactions can be extended easily. We also provide results using third-order interactions.
Third, to establish statistical significance, we generate a large number of permutations of the dependent variable (case or control) to see how the measures of interaction for the real data compare with those from the many permutations.
Finally, when these procedures are applied to the data, they lead to a number of interesting findings. It is shown that there are a substantial number of significant interactions that form a network in which BRCA1 plays a dominant role. The interactions of BRCA1 with many of the other genes suggests the conjecture that BRCA1 serves as a protective gene and that some mutations in it or in related genes may prevent it from carrying out the protective function.Biostatistics, Geneticsshl5, tz33StatisticsArticlesProbing genetic overlap among complex human phenotypes
https://academiccommons.columbia.edu/catalog/ac:184989
Rzhetsky, Andrey; Wajngurt, David; Park, Naeun; Zheng, Tianhttp://dx.doi.org/10.7916/D8MS3RPRSat, 28 Mar 2015 15:35:51 +0000Geneticists and epidemiologists often observe that certain hereditary disorders cooccur in individual patients significantly more (or significantly less) frequently than expected, suggesting there is a genetic variation that predisposes its bearer to multiple disorders, or that protects against some disorders while predisposing to others. We suggest that, by using a large number of phenotypic observations about multiple disorders and an appropriate statistical model, we can infer genetic overlaps between phenotypes. Our proof-of-concept analysis of 1.5 million patient records and 161 disorders indicates that disease phenotypes form a highly connected network of strong pairwise correlations. Our modeling approach, under appropriate assumptions, allows us to estimate from these correlations the size of putative genetic overlaps. For example, we suggest that autism, bipolar disorder, and schizophrenia share significant genetic overlaps. Our disease network hypothesis can be immediately exploited in the design of genetic mapping approaches that involve joint linkage or association analyses of multiple seemingly disparate phenotypes.Biostatistics, Geneticstz33StatisticsArticlesA demonstration and findings of a statistical approach through reanalysis of inflammatory bowel disease data
https://academiccommons.columbia.edu/catalog/ac:184986
Lo, Shaw-Hwa; Zheng, Tianhttp://dx.doi.org/10.7916/D8W95829Sat, 28 Mar 2015 15:25:27 +0000We test the backward haplotype transmission association algorithm on genome-scan data previously studied by Rioux et al. [Rioux, J. D., et al. (2000) Am. J. Hum. Genet. 66, 1863–1870]. In their study, multipoint linkage methods were applied to affected sib-pairs with inflammatory bowel disease, and significant linkage evidence points to two susceptibility loci. After we apply our approach to these data with a global search accounting for both joint and marginal effects, very interesting results emerge, many of them intriguing. These results provide compelling support for the application of our approach to other data wherever applicable. Results from this project also make it clear that it is important to reinvestigate available family-based datasets that can be suitably reanalyzed. Given previously collected data in the literature, our approach, with its increased efficiency in using available resources, draws additional crucial information that may lead to novel and surprising results.Biostatistics, Geneticsshl5, tz33StatisticsArticlesComment: Quantifying the Fraction of Missing Information for Hypothesis Testing in Statistical and Genetic Studies
https://academiccommons.columbia.edu/catalog/ac:184983
Zheng, Tian; Lo, Shaw-Hwahttp://dx.doi.org/10.7916/D84T6H8MSat, 28 Mar 2015 15:10:24 +0000The authors suggest an interesting way to measure
the fraction of missing information in the context of
hypothesis testing. The measure seeks to quantify the
impact of missing observations on the test between two
hypotheses. The amount of impact can be useful information
for applied research. An example is, in genetics,
where multiple tests of the same sort are performed
on different variables with different missing rates, and
follow-up studies may be designed to resolve missing
values in selected variables.
In this discussion, we offer our prospective views on
the use of relative information in a follow-up study.
For studies where the impact of missing observations
varies greatly across different variables and where the
investigators have the flexibility of designing studies
that can have different efforts on variables, an optimal
design may be derived using relative information measures
to improve the cost-effectiveness of the followup.Statisticstz33, shl5StatisticsArticlesBayesian hierarchical graph-structured model for pathway analysis using gene expression data
https://academiccommons.columbia.edu/catalog/ac:184980
Zhou, Hui; Zheng, Tianhttp://dx.doi.org/10.7916/D8DB80QNSat, 28 Mar 2015 14:46:49 +0000In genomic analysis, there is growing interest in network structures that represent biochemistry interactions. Graph structured or constrained inference takes advantage of a known relational structure among variables to introduce smoothness and reduce complexity in modeling, especially for high-dimensional genomic data. There has been a lot of interest in its application in model regularization and selection. However, prior knowledge on the graphical structure among the variables can be limited and partial. Empirical data may suggest variations and modifications to such a graph, which could lead to new and interesting biological findings. In this paper, we propose a Bayesian random graph-constrained model, rGrace, an extension from the Grace model, to combine a priori network information with empirical evidence, for applications such as pathway analysis. Using both simulations and real data examples, we show that the new method, while leading to improved predictive performance, can identify discrepancy between data and a prior known graph structure and suggest modifications and updates.Biostatistics, Geneticstz33StatisticsArticlesOn Bootstrap Tests of Symmetry About an Unknown Median
https://academiccommons.columbia.edu/catalog/ac:184965
Zheng, Tian; Gastwirth, Joseph L.http://dx.doi.org/10.7916/D8X9296PFri, 27 Mar 2015 16:05:28 +0000It is important to examine the symmetry of an underlying distribution before applying some statistical procedures to a data set. For example, in the Zuni School District case, a formula originally developed by the Department of Education trimmed 5% of the data symmetrically from each end. The validity of this procedure was questioned at the hearing by Chief Justice Roberts. Most tests of symmetry (even nonparametric ones) are not distribution free in finite sample sizes. Hence, using asymptotic distribution may not yield an accurate type I error rate or/and loss of power in small samples. Bootstrap resampling from a symmetric empirical distribution function fitted to the data is proposed to improve the accuracy of the calculated p-value of several tests of symmetry. The results show that the bootstrap method is superior to previously used approaches relying on the asymptotic distribution of the tests that assumed the data come from a normal distribution. Incorporating the bootstrap estimate in a recently proposed test due to Miao, Gel and Gastwirth (2006) preserved its level and shows it has reasonable power properties on the family of distribution evaluated.Statisticstz33StatisticsArticlesGenetic-linkage mapping of complex hereditary disorders to a whole-genome molecular-interaction network
https://academiccommons.columbia.edu/catalog/ac:184959
Iossifov, Ivan; Zheng, Tian; Baron, Miron; Gilliam, T. Conrad; Rzhetsky, Andreyhttp://dx.doi.org/10.7916/D85T3JD0Fri, 27 Mar 2015 15:56:46 +0000Common hereditary neurodevelopmental disorders such as autism, bipolar disorder, and schizophrenia are most likely both genetically multifactorial and heterogeneous. Because of these characteristics traditional methods for genetic analysis fail when applied to such diseases. To address the problem we propose a novel probabilistic framework that combines the standard genetic linkage formalism with whole-genome molecular-interaction data to predict pathways or networks of interacting genes that contribute to common heritable disorders. We apply the model to three large genotype–phenotype data sets, identify a small number of significant candidate genes for autism (24), bipolar disorder (21), and schizophrenia (25), and predict a number of gene targets likely to be shared among the disorders.Biostatistics, Geneticstz33StatisticsArticlesLatent demographic profile estimation in hard-to-reach groups
https://academiccommons.columbia.edu/catalog/ac:184956
McCormick, Tyler H.; Zheng, Tianhttp://dx.doi.org/10.7916/D8F76BFQFri, 27 Mar 2015 15:49:06 +0000The sampling frame in most social science surveys excludes members of certain groups, known as hard-to-reach groups. These groups, or subpopulations, may be difficult to access (the homeless, e.g.), camouflaged by stigma (individuals with HIV/AIDS), or both (commercial sex workers). Even basic demographic information about these groups is typically unknown, especially in many developing nations. We present statistical models which leverage social network structure to estimate demographic characteristics of these subpopulations using Aggregated relational data (ARD), or questions of the form “How many X’s do you know?” Unlike other network-based techniques for reaching these groups, ARD require no special sampling strategy and are easily incorporated into standard surveys. ARD also do not require respondents to reveal their own group membership. We propose a Bayesian hierarchical model for estimating the demographic characteristics of hard-to-reach groups, or latent demographic profiles, using ARD. We propose two estimation techniques. First, we propose a Markov-chain Monte Carlo algorithm for existing data or cases where the full posterior distribution is of interest. For cases when new data can be collected, we propose guidelines and, based on these guidelines, propose a simple estimate motivated by a missing data approach. Using data from McCarty et al. [Human Organization 60 (2001) 28–39], we estimate the age and gender profiles of six hard-to-reach groups, such as individuals who have HIV, women who were raped, and homeless persons. We also evaluate our simple estimates using simulation studies.Statisticstz33StatisticsArticlesDiscovering influential variables: A method of partitions
https://academiccommons.columbia.edu/catalog/ac:184953
Chernoff, Herman; Lo, Shaw-Hwa; Zheng, Tianhttp://dx.doi.org/10.7916/D8PR7TVMFri, 27 Mar 2015 15:41:19 +0000A trend in all scientific disciplines, based on advances in technology, is the increasing availability of high dimensional data in which are buried important information. A current urgent challenge to statisticians is to develop effective methods of finding the useful information from the vast amounts of messy and noisy data available, most of which are noninformative. This paper presents a general computer intensive approach, based on a method pioneered by Lo and Zheng for detecting which, of many potential explanatory variables, have an influence on a dependent variable Y. This approach is suited to detect influential variables, where causal effects depend on the confluence of values of several variables. It has the advantage of avoiding a difficult direct analysis, involving possibly thousands of variables, by dealing with many randomly selected small subsets from which smaller subsets are selected, guided by a measure of influence I. The main objective is to discover the influential variables, rather than to measure their effects. Once they are detected, the problem of dealing with a much smaller group of influential variables should be vulnerable to appropriate analysis. In a sense, we are confining our attention to locating a few needles in a haystack.Statistics, Computer scienceshl5, tz33StatisticsArticlesConstructing gene association networks for rheumatoid arthritis using the backward genotype-trait association (BGTA) algorithm
https://academiccommons.columbia.edu/catalog/ac:184950
Ding, Yuejing; Cong, Lei; Ionita-Laza, Iuliana; Lo, Shaw-Hwa; Zheng, Tianhttp://dx.doi.org/10.7916/D8Z89B92Fri, 27 Mar 2015 15:31:50 +0000Rheumatoid arthritis (RA, MIM 180300) is a common and complex inflammatory disorder. The North American Rheumatoid Arthritis Consortium (NARAC) data, as part of the Genetic Analysis Workshop 15 data, consists of both genome scan and candidate gene studies on RA patients.
We applied the backward genotype-trait association (BGTA) algorithm to capture marginal and gene × gene interaction effects of multiple susceptibility loci on RA disease status. A two-stage screening approach was used for the genome scan, whereas a comprehensive study of all possible subsets was conducted for the candidate genes. For the genome scan, we constructed an association network among 39 genetic loci that demonstrated strong signals, 19 of which have been reported in the RA literature. For the candidate genes, we found strong signals for PTPN22 and SUMO4. Based on significant association evidence, we built an association network among the loci of PTPN22, PADI4, DLG5, SLC22A4, SUMO4, and CARD15. To control for false positives, we used permutation tests to constrain the family-wise type I error rate to 1%.
Using the BGTA algorithm, we identified genetic loci and candidate genes that were associated with RA susceptibility and association networks among them. For the first time, we report possible interactions between single-nucleotide polymorphisms/genes, which may be useful for biological interpretation.Genetics, Biostatisticsii2135, shl5, tz33Biostatistics, StatisticsArticlesJoint study of genetic regulators for expression traits related to breast cancer
https://academiccommons.columbia.edu/catalog/ac:184947
Zheng, Tian; Wang, Shuang; Cong, Lei; Ding, Yuejing; Ionita-Laza, Iuliana; Lo, Shaw-Hwahttp://dx.doi.org/10.7916/D86T0KHXFri, 27 Mar 2015 15:24:12 +0000The mRNA expression levels of genes have been shown to have discriminating power for the classification of breast cancer. Studying the heritability of gene expression levels on breast cancer related transcripts can lead to the identification of shared common regulators and inter-regulation patterns, which would be important for dissecting the etiology of breast cancer.
We applied multilocus association genome-wide scans to 18 breast cancer related transcripts and combined the results with traditional linkage scans. Regulatory hotspots for these transcripts were identified and some inter-regulation patterns were observed. We also derived evidence on interacting genetic regulatory loci shared by a number of these transcripts.
In this paper, by restricting to a set of related genes, we were able to employ a more detailed multilocus approach that evaluates both marginal and interaction association signals at each single-nucleotide polymorphism. Interesting inter-regulation patterns and significant overlaps of genetic regulators between transcripts were observed. Interaction association results returned more expression quantitative trait locus hotspots that are significant.Genetics, Biostatisticstz33, sw2206, ii2135, shl5Statistics, BiostatisticsArticlesTranscription activity hot spot, is it real or an artifact?
https://academiccommons.columbia.edu/catalog/ac:184944
Wang, Shuang; Zheng, Tian; Wang, Yuanjiahttp://dx.doi.org/10.7916/D808647VFri, 27 Mar 2015 15:03:53 +0000Transcription activity 'hot spots', defined as chromosome regions that contain more expression quantitative trait loci than would have been expected by chance, have been frequently detected both in humans and in model organisms. It has been common to consider the existence of hot spots as evidence for master regulation of gene expression. However, hot spots could also simply be due to highly correlated gene expressions or linkage disequilibrium and do not truly represent master regulators. A recent simulation study using real human gene expression data but simulated random single-nucleotide polymorphism genotypes showed patterns of clustering of expression quantitative trait loci that resemble those in actual studies [Perez-Enciso: Genetics 2004, 166: 547–554.]. In this study, to assess the credibility of transcription activity hot spots, we conducted genetic analyses on gene expressions provided by Genetic Analysis Workshop 15 Problem 1.Genetics, Biostatisticssw2206, tz33, yw2016Biostatistics, StatisticsArticlesPattern-based mining strategy to detect multi-locus association and gene × environment interaction
https://academiccommons.columbia.edu/catalog/ac:184941
Li, Zhong; Zheng, Tian; Califano, Andrea; Floratos, Aristidishttp://dx.doi.org/10.7916/D8H70DQGFri, 27 Mar 2015 14:54:47 +0000As genome-wide association studies grow in popularity for the identification of genetic factors for common and rare diseases, analytical methods to comb through large numbers of genetic variants efficiently to identify disease association are increasingly in demand. We have developed a pattern-based data-mining approach to discover unlinked multilocus genetic effects for complex disease and to detect genotype × phenotype/genotype × environment interactions. On a densely mapped chromosome 18 data set for rheumatoid arthritis that was made available by Genetic Analysis Workshop 15, this method detected two potential two-locus associations as well as a putative two-locus gene × gender interaction.Genetics, Biostatisticszl2147, tz33, ac2248, af2202Systems Biology, Statistics, Biomedical InformaticsArticlesIdentification of gene interactions associated with disease from gene expression data using synergy networks
https://academiccommons.columbia.edu/catalog/ac:184938
Watkinson, John; Wang, Xiaodong; Zheng, Tian; Anastassiou, Dimitrishttp://dx.doi.org/10.7916/D81835DPFri, 27 Mar 2015 14:47:01 +0000Analysis of microarray data has been used for the inference of gene-gene interactions. If, however, the aim is the discovery of disease-related biological mechanisms, then the criterion for defining such interactions must be specifically linked to disease.
Here we present a computational methodology that jointly analyzes two sets of microarray data, one in the presence and one in the absence of a disease, identifying gene pairs whose correlation with disease is due to cooperative, rather than independent, contributions of genes, using the recently developed information theoretic measure of synergy. High levels of synergy in gene pairs indicates possible membership of the two genes in a shared pathway and leads to a graphical representation of inferred gene-gene interactions associated with disease, in the form of a "synergy network." We apply this technique on a set of publicly available prostate cancer expression data and successfully validate our results, confirming that they cannot be due to pure chance and providing a biological explanation for gene pairs with exceptionally high synergy.
Thus, synergy networks provide a computational methodology helpful for deriving "disease interactomes" from biological data. When coupled with additional biological knowledge, they can also be helpful for deciphering biological mechanisms responsible for disease.Genetics, Biostatisticsxw2008, tz33, da8Electrical Engineering, StatisticsArticlesRheumatoid arthritis-associated gene-gene interaction network for rheumatoid arthritis candidate genes
https://academiccommons.columbia.edu/catalog/ac:184935
Huang, Chien-Hsun; Cong, Lei; Xie, Jun; Qiao, Bo; Lo, Shaw-Hwa; Zheng, Tianhttp://dx.doi.org/10.7916/D8J67FTVFri, 27 Mar 2015 14:37:18 +0000Rheumatoid arthritis (RA, MIM 180300) is a chronic and complex autoimmune disease. Using the North American Rheumatoid Arthritis Consortium (NARAC) data set provided in Genetic Analysis Workshop 16 (GAW16), we used the genotype-trait distortion (GTD) scores and proposed analysis procedures to capture the gene-gene interaction effects of multiple susceptibility gene regions on RA. In this paper, we focused on 27 RA candidate gene regions (531 SNPs) based on a literature search. Statistical significance was evaluated using 1000 permutations. HLADRB1 was found to have strong marginal association with RA. We identified 14 significant interactions (p < 0.01), which were aggregated into an association network among 12 selected candidate genes PADI4, FCGR3, TNFRSF1B, ITGAV, BTLA, SLC22A4, IL3, VEGF, TNF, NFKBIL1, TRAF1-C5, and MIF. Based on our and other contributors' findings during the GAW16 conference, we further studied 24 candidate regions with 336 SNPs. We found 23 significant interactions (p-value < 0.01), nine interactions in addition to our initial findings, and the association network was extended to include candidate genes HLA-A, HLA-B, HLA-C, CTLA4, and IL6. As we will discuss in this paper, the reported possible interactions between genes may suggest potential biological activities of RA.Genetics, Biostatisticsshl5, tz33StatisticsArticlesGenome-wide gene-based analysis of rheumatoid arthritis-associated interaction with PTPN22 and HLA-DRB1
https://academiccommons.columbia.edu/catalog/ac:184932
Qiao, Bo; Huang, Chien-Hsun; Cong, Lei; Xie, Jun; Lo, Shaw-Hwa; Zheng, Tianhttp://dx.doi.org/10.7916/D8SQ8Z92Fri, 27 Mar 2015 14:27:30 +0000The genes PTPN22 and HLA-DRB1 have been found by a number of studies to confer an increased risk for rheumatoid arthritis (RA), which indicates that both genes play an important role in RA etiology. It is believed that they not only have strong association with RA individually, but also interact with other related genes that have not been found to have predisposing RA mutations. In this paper, we conduct genome-wide searches for RA-associated gene-gene interactions that involve PTPN22 or HLA-DRB1 using the Genetic Analysis Workshop 16 Problem 1 data from the North American Rheumatoid Arthritis Consortium. MGC13017, HSPCAL3, MIA, PTPNS1L, and IGLVI-70, which showed association with RA in previous studies, have been confirmed in our analysis.Genetics, Biostatisticsshl5, tz33StatisticsArticlesIdentifying rare disease variants in the Genetic Analysis Workshop 17 simulated data: a comparison of several statistical approaches
https://academiccommons.columbia.edu/catalog/ac:184928
Fan, Ruixue; Huang, Chien-Hsun; Lo, Shaw-Hwa; Zheng, Tian; Ionita-Laza, Iulianahttp://dx.doi.org/10.7916/D89P30J1Fri, 27 Mar 2015 14:16:59 +0000Genome-wide association studies have been successful at identifying common disease variants associated with complex diseases, but the common variants identified have small effect sizes and account for only a small fraction of the estimated heritability for common diseases. Theoretical and empirical studies suggest that rare variants, which are much less frequent in populations and are poorly captured by single-nucleotide polymorphism chips, could play a significant role in complex diseases. Several new statistical methods have been developed for the analysis of rare variants, for example, the combined multivariate and collapsing method, the weighted-sum method and a replication-based method. Here, we apply and compare these methods to the simulated data sets of Genetic Analysis Workshop 17 and thereby explore the contribution of rare variants to disease risk. In addition, we investigate the usefulness of extreme phenotypes in identifying rare risk variants when dealing with quantitative traits. Finally, we perform a pathway analysis and show the importance of the vascular endothelial growth factor pathway in explaining different phenotypes.Genetics, Biostatisticsrf2283, shl5, tz33, ii2135Statistics, BiostatisticsArticlesNew insights into old methods for identifying causal rare variants
https://academiccommons.columbia.edu/catalog/ac:184925
Wang, Haitian; Huang, Chien-Hsun; Lo, Shaw-Hwa; Zheng, Tian; Hu, Inchihttp://dx.doi.org/10.7916/D8K64H03Fri, 27 Mar 2015 14:09:21 +0000The advance of high-throughput next-generation sequencing technology makes possible the analysis of rare variants. However, the investigation of rare variants in unrelated-individuals data sets faces the challenge of low power, and most methods circumvent the difficulty by using various collapsing procedures based on genes, pathways, or gene clusters. We suggest a new way to identify causal rare variants using the F-statistic and sliced inverse regression. The procedure is tested on the data set provided by the Genetic Analysis Workshop 17 (GAW17). After preliminary data reduction, we ranked markers according to their F-statistic values. Top-ranked markers were then subjected to sliced inverse regression, and those with higher absolute coefficients in the most significant sliced inverse regression direction were selected. The procedure yields good false discovery rates for the GAW17 data and thus is a promising method for future study on rare variants.Genetics, Biostatisticsshl5, tz33StatisticsArticlesAssociation screening for genes with multiple potentially rare variants: an inverse-probability weighted clustering approach
https://academiccommons.columbia.edu/catalog/ac:184921
Liu, Ying; Huang, Chien-Hsun; Hu, Inchi; Zheng, Tian; Lo, Shaw-Hwahttp://dx.doi.org/10.7916/D8BP01QVFri, 27 Mar 2015 13:51:07 +0000Both common variants and rare variants are involved in the etiology of most complex diseases in humans. Developments in sequencing technology have led to the identification of a high density of rare variant single-nucleotide polymorphisms (SNPs) on the genome, each of which affects only at most 1% of the population. Genotypes derived from these SNPs allow one to study the involvement of rare variants in common human disorders. Here, we propose an association screening approach that treats genes as units of analysis. SNPs within a gene are used to create partitions of individuals, and inverse-probability weighting is used to overweight genotypic differences observed on rare variants. Association between a phenotype trait and the constructed partition is then evaluated. We consider three association tests (one-way ANOVA, chi-square test, and the partition retention method) and compare these strategies using the simulated data from the Genetic Analysis Workshop 17. Several genes that contain causal SNPs were identified by the proposed method as top genes.Genetics, Biostatisticsyl2802, tz33, shl5Biostatistics, StatisticsArticlesIdentifying influential regions in extremely rare variants using a fixed-bin approach
https://academiccommons.columbia.edu/catalog/ac:184917
Agne, Michael; Huang, Chien-Hsun; Hu, Inchi; Wang, Haitian; Zheng, Tian; Lo, Shaw-Hwahttp://dx.doi.org/10.7916/D8VM4B5WFri, 27 Mar 2015 13:44:41 +0000In this study, we analyze the Genetic Analysis Workshop 17 data to identify regions of single-nucleotide polymorphisms (SNPs) that exhibit a significant influence on response rate (proportion of subjects with an affirmative affected status), called the affected ratio, among rare variants. Under the null hypothesis, the distribution of rare variants is assumed to be uniform over case (affected) and control (unaffected) subjects. We attempt to pinpoint regions where the composition is significantly different between case and control events, specifically where there are unusually high numbers of rare variants among affected subjects. We focus on private variants, which require a degree of “collapsing” to combine information over several SNPs, to obtain meaningful results. Instead of implementing a gene-based approach, where regions would vary in size and sometimes be too small to achieve a strong enough signal, we implement a fixed-bin approach, with a preset number of SNPs per region, relying on the assumption that proximity and similarity go hand in hand. Through application of 100-SNP and 30-SNP fixed bins, we identify several most influential regions, which later are seen to contain some of the causal SNPs. The 100- and 30-SNP approaches detected seven and three causal SNPs among the most significant regions, respectively, with two overlapping SNPs located in the ELAVL4 gene, reported by both procedures.Genetics, Biostatisticsmra2110, tz33, shl5StatisticsArticlesConsidering interactive effects in the identification of influential regions with extremely rare variants via fixed bin approach
https://academiccommons.columbia.edu/catalog/ac:184914
Agne, Michael; Huang, Chien-Hsun; Hu, Inchi; Wang, Haitian; Zheng, Tian; Lo, Shaw-Hwahttp://dx.doi.org/10.7916/D8445KCHFri, 27 Mar 2015 13:37:48 +0000In this study, we analyze the Genetic Analysis Workshop 18 (GAW18) data to identify regions of single-nucleotide polymorphisms (SNPs), which significantly influence hypertension status among individuals. We have studied the marginal impact of these regions on disease status in the past, but we extend the method to deal with environmental factors present in data collected over several exam periods. We consider the respective interactions between such traits as smoking status and age with the genetic information and hope to augment those genetic regions deemed influential marginally with those that contribute via an interactive effect. In particular, we focus only on rare variants and apply a procedure to combine signal among rare variants in a number of "fixed bins" along the chromosome. We extend the procedure in Agne et al to incorporate environmental factors by dichotomizing subjects via traits such as smoking status and age, running the marginal procedure among each respective category (i.e., smokers or nonsmokers), and then combining their scores into a score for interaction. To avoid overlap of subjects, we examine each exam period individually. Out of a possible 629 fixed-bin regions in chromosome 3, we observe that 11 show up in multiple exam periods for gene-smoking score. Fifteen regions exhibit significance for multiple exam periods for gene-age score, with 4 regions deemed significant for all 3 exam periods. The procedure pinpoints SNPs in 8 "answer" genes, with 5 of these showing up as significant in multiple testing schemes (Gene-Smoking, Gene-Age for Exams 1, 2, and 3).Genetics, Biostatisticsmra2110, tz33, shl5StatisticsArticlesA dual-clustering framework for association screening with whole genome sequencing data and longitudinal traits
https://academiccommons.columbia.edu/catalog/ac:184911
Lui, Ying; Huang, Chien-Hsun; Hu, Inchi; Zheng, Tian; Lo, Shaw-Hwahttp://dx.doi.org/10.7916/D8N29VVKFri, 27 Mar 2015 13:31:58 +0000Current sequencing technology enables generation of whole genome sequencing data sets that contain a high density of rare variants, each of which is carried by, at most, 5% of the sampled subjects. Such variants are involved in the etiology of most common diseases in humans. These diseases can be studied by relevant longitudinal phenotype traits. Tests for association between such genotype information and longitudinal traits allow the study of the function of rare variants in complex human disorders. In this paper, we propose an association-screening framework that highlights the genotypic differences observed on rare variants and the longitudinal nature of phenotypes. In particular, both variants within a gene and longitudinal phenotypes are used to create partitions of subjects. Association between the 2 sets of constructed partitions is then evaluated. We apply the proposed strategy to the simulated data from the Genetic Analysis Workshop 18 and compare the obtained results with those from sequence kernel association test using the receiver operating characteristic curves.Genetics, Biostatisticstz33, shl5StatisticsArticlesA partition-based approach to identify gene-environment interactions in genome wide association studies
https://academiccommons.columbia.edu/catalog/ac:184908
Fan, Ruixue; Huang, Chien-Hsun; Hu, Inchi; Wang, Haitan; Zheng, Tian; Lo, Shaw-Hwahttp://dx.doi.org/10.7916/D8542MGFFri, 27 Mar 2015 13:10:04 +0000It is believed that almost all common diseases are the consequence of complex interactions between genetic markers and environmental factors. However, few such interactions have been documented to date. Conventional statistical methods for detecting gene and environmental interactions are often based on the linear regression model, which assumes a linear interaction effect. In this study, we propose a nonparametric partition-based approach that is able to capture complex interaction patterns. We apply this method to the real data set of hypertension provided by Genetic Analysis Workshop 18. Compared with the linear regression model, the proposed approach is able to identify many additional variants with significant gene-environmental interaction effects. We further investigate one single-nucleotide polymorphism identified by our method and show that its gene-environmental interaction effect is, indeed, nonlinear. To adjust for the family dependence of phenotypes, we apply different permutation strategies and investigate their effects on the outcomes.Genetics, Biostatisticsrf2283, tz33, shl5StatisticsArticlesDiscovering pure gene-environment interactions in blood pressure genome-wide association studies data: a two-step approach incorporating new statistics
https://academiccommons.columbia.edu/catalog/ac:184905
Wang, Maggie Haitan; Huang, Chien-Hsun; Zheng, Tian; Lo, Shaw-Hwa; Hu, Inchihttp://dx.doi.org/10.7916/D8DN43X5Fri, 27 Mar 2015 12:57:34 +0000Environment has long been known to play an important part in disease etiology. However, not many genome-wide association studies take environmental factors into consideration. There is also a need for new methods to identify the gene-environment interactions. In this study, we propose a 2-step approach incorporating an influence measure that capturespure gene-environment effect. We found that pure gene-age interaction has a stronger association than considering the genetic effect alone for systolic blood pressure, measured by counting the number of single-nucleotide polymorphisms (SNPs)reaching a certain significance level. We analyzed the subjects by dividing them into two age groups and found no overlap in the top identified SNPs between them. This suggested that age might have a nonlinear effect on genetic association. Furthermore, the scores of the top SNPs for the two age subgroups were about 3times those obtained when using all subjects for systolic blood pressure. In addition, the scores of the older age subgroup were much higher than those for the younger group. The results suggest that genetic effects are stronger in older age and that genetic association studies should take environmental effects into consideration, especially age.Genetics, Biostatisticstz33, shl5StatisticsArticlesSelecting informative genes for discriminant analysis using multigene expression profiles.
https://academiccommons.columbia.edu/catalog/ac:184902
Yan, Xin; Zheng, Tianhttp://dx.doi.org/10.7916/D8XK8DF3Fri, 27 Mar 2015 12:49:30 +0000Gene expression data extracted from microarray experiments have been used to study the difference between mRNA abundance of genes under different conditions. In one of such experiments, thousands of genes are measured simultaneously, which provides a high-dimensional feature space for discriminating between different sample classes. However, most of these dimensions are not informative about the between-class difference, and add noises to the discriminant analysis.
In this paper we propose and study feature selection methods that evaluate the "informativeness" of a set of genes. Two measures of information based on multigene expression profiles are considered for a backward information-driven screening approach for selecting important gene features. By considering multigene expression profiles, we are able to utilize interaction information among these genes. Using a breast cancer data, we illustrate our methods and compare them to the performance of existing methods.
We illustrate in this paper that methods considering gene-gene interactions have better classification power in gene expression analysis. In our results, we identify important genes with relative large p-values from single gene tests. This indicates that these are genes with weak marginal information but strong interaction information, which will be overlooked by strategies that only examine individual genes.Biostatistics, Geneticstz33StatisticsArticlesOn Identifying Rare Variants for Complex Human Traits
https://academiccommons.columbia.edu/catalog/ac:197118
Fan, Ruixuehttp://dx.doi.org/10.7916/D8N29VT4Mon, 16 Mar 2015 12:24:34 +0000This thesis focuses on developing novel statistical tests for rare variants association analysis incorporating both marginal effects and interaction effects among rare variants. Compared with common variants, rare variants have lower minor allele frequencies (typically less than 5%), and hence traditional association tests for common variants will lose power for rare variants. Therefore, there is a pressing need of new analytical tools to tackle the problem of rare variants association with complex human traits. Several collapsing methods have been proposed that aggregate information of rare variants in a region and test them together. They can be divided into burden tests and non-burden tests based on their aggregation strategies. They are all variations of regression-based methods with the assumption that the phenotype is associated with the genotype via a (linear) regression model. Most of these methods consider only marginal effects of rare variants and fail to take into account gene-gene and gene-environmental interactive effects, which are ubiquitous and are of utmost importance in biological systems. In this thesis, we propose a summation of partition approach (SPA) -- a nonparametric strategy for rare variants association analysis. Extensive simulation studies show that SPA is powerful in detecting not only marginal effects but also gene-gene interaction effects of rare variants. Moreover, extensions of SPA are able to detect gene-environment interactions and other interactions existing in complicated biological system as well. We are also able to obtain the asymptotic behavior of the marginal SPA score, which guarantees the power of the proposed method. Inspired by the idea of stepwise variable selection, a significance-based backward dropping algorithm(SDA) is proposed to locate truly influential rare variants in a genetic region that has been identified significant. Unlike traditional backward dropping approaches which remove the least significant variables first, SDA introduces the idea of eliminating the most significant variable at each round. The removed variables are collected and their effects are evaluated by an influence ratio score -- the relative p-value change. Our simulation studies show that SDA is powerful to detect causal variables and SDA has lower false discovery rate than LASSO. We also demonstrate our method using the dataset provided by Genetic Analysis Workshop (GAW) 17 and the results support the superiority of SDA over LASSO. The general partition-retention framework can also be applied to detect gene-environmental interaction effects for common variants. We demonstrate this method using the dataset from Genetic Analysis Workshop (GAW) 18. Our nonparametric approach is able to identify a lot more possible influential gene-environmental pairs than traditional linear regression models. We propose in this thesis a "SPA-SDA" two step approach for rare variants association analysis at genomic scale: first identify significant regions of moderate sizes using SPA, and then apply SDA to the identified regions to pinpoint truly influential variables. This approach is computationally efficient for genomic data and it has the capacity to detect gene-gene and gene-environmental interactions.Statistics, Bioinformatics, Human genetics--Variation, Regression analysis, Genetics--Statistical methods, Genomics--Data Processingrf2283StatisticsDissertationsProtecting Minorities in Large Binary Elections: A Test of Storable Votes Using Field Data
https://academiccommons.columbia.edu/catalog/ac:182487
Casella, Alessandra M.; Gelman, Andrew E.; Ehrenberg, Shuky; Shen, Jiehttp://dx.doi.org/10.7916/D8KH0M4QSun, 08 Feb 2015 19:13:55 +0000The legitimacy of democratic systems requires the protection of minority preferences while ideally treating every voter equally. During the 2006 student elections at Columbia University, we asked voters to rank the importance of different contests and to choose where to cast a single extra "bonus vote," had one been available — a simple version of Storable Votes. We then constructed distributions of intensities and electoral outcomes and estimated the probable impact of the bonus vote through bootstrapping techniques. The bonus vote performs well: when minority preferences are particularly intense, the minority wins at least one contest with 15-30 percent probability; when the minority wins, aggregate welfare increases with 85-95 percent probability. The paper makes two contributions: it tests the performance of storable votes in a setting where preferences were not controlled, and it suggests the use of bootstrapping techniques when appropriate replications of the data cannot be obtained.Political scienceac186, ag389Economics, StatisticsArticlesSPAr package for Fan and Lo (2013) "A robust model-free approach for rare variants association studies incorporating gene-gene and gene-environmental interactions."
https://academiccommons.columbia.edu/catalog/ac:179424
Fan, Ruixue; Lo, Shaw-Hwahttp://dx.doi.org/10.7916/D84Q7SN6Fri, 07 Nov 2014 15:08:09 +0000Recently more and more evidence suggest that rare variants with much lower minor allele frequencies play significant roles in disease etiology. Advances in next-generation sequencing technologies will lead to many more rare variants association studies. Several statistical methods have been proposed to assess the effect of rare variants by aggregating information from multiple loci across a genetic region and testing the association between the phenotype and aggregated genotype. One limitation of existing methods is that they only look into the marginal effects of rare variants but do not systematically take into account effects due to interactions among rare variants and between rare variants and environmental factors. In this article, we propose the summation of partition approach (SPA), a robust model-free method that is designed specifically for detecting both marginal effects and effects due to gene-gene (G×G) and gene-environmental (G×E) interactions for rare variants association studies. SPA has three advantages. First, it accounts for the interaction information and gains considerable power in the presence of unknown and complicated G×G or G×E interactions. Secondly, it does not sacrifice the marginal detection power; in the situation when rare variants only have marginal effects it is comparable with the most competitive method in current literature. Thirdly, it is easy to extend and can incorporate more complex interactions; other practitioners and scientists can tailor the procedure to fit their own study friendly. Our simulation studies show that SPA is considerably more powerful than many existing methods in the presence of G×G and G×E interactions.
This package is also maintained on the Comprehensive R Archive Network (http://cran.r-project.org). It contains the R programs, user's manual and example codes.Genetics, Statisticsrf2283, shl5StatisticsComputer softwareSource codes for GLMLE algorithm
https://academiccommons.columbia.edu/catalog/ac:178966
Zheng, Tian; He, Ranhttp://dx.doi.org/10.7916/D8HH6HQRFri, 24 Oct 2014 16:03:53 +0000These are the R source codes for the algorithm proposed for fitting exponential random graph models (ERGMs) on large social networks in our paper "Estimation of exponential random graph models for large social networks via graph limits". Specifically, the ERGM model we implement is the one that consider homomorphism densities of edges, two-stars and triangles, the one we examine in the above paper.Statistics, Computer sciencetz33, rh2528StatisticsComputer softwareMathematical Modeling of Insider Trading
https://academiccommons.columbia.edu/catalog/ac:178871
Bilina Falafala, Roselinehttp://dx.doi.org/10.7916/D89W0D33Mon, 13 Oct 2014 12:41:52 +0000In this thesis, we study insider trading and consider a financial market and an enlarged financial market whose sets of information are respectively represented by the filtrations F and G. The filtration G is obtained by initially expanding the filtration F. We also consider that we have a finite trading horizon. First, we show that under certain conditions the enlarged market satisfies no free lunch with vanishing risk (NFLVR) locally and therefore satisfies no arbitrage with respect to admissible simple predictable trading strategies. In addition, we generalize the structure of all the G local martingale deflators and find sufficient conditions under which the enlarged market satisfies NFLVR. We apply our results to some recent examples of insider trading that have appeared in newspapers and by doing so, show the limitations of some previous works that have studied the stability of the NFLVR property under an initial expansion. \newline
Second, assuming the enlarged market satisfies NFLVR and markets are incomplete, we define a notion of risk and compare the risk of a market or liquidity trader to the risk of an insider trader. We prove that the risk of an insider is smaller than the risk of a market/liquidity trader under some sufficient conditions that involve their respective trading strategies. We find a relationship between the trading strategies of a market trader and of an insider when the risk neutral measure of the market is used. If an insider trades using the market risk neutral measure and not her own, then her trading strategy should involve not only the stock but also the volatility of the stock. \newline
Finally, assuming that the enlarged market satisfies NFLVR locally, we provide a way for an insider to price her financial claims. We also define a new type of process that we call a quasi-local martingale and prove that the stock price process under local NFLVR is one such process.Applied mathematics, FinanceStatisticsDissertationsCorrection: A dual clustering framework for association screening with whole genome sequencing data and longitudinal traits
https://academiccommons.columbia.edu/catalog/ac:200852
Liu, Ying; Huang, Chien-Hsun; Hu, Inchi; Lo, Shaw-Hwa; Zheng, Tianhttp://dx.doi.org/10.7916/D8CV4G8ZTue, 23 Sep 2014 06:06:17 +0000Correction: For the previous publication of our article 1, Figure 1 was incorrectly processed as grayscale. We present, here in this correction, the original Figure in full color. Figure 1 Clustering of individuals using SNPs with MAFs between 0.01 and 0.05 for MAP4 Clustering of individuals using SNPs with MAFs between 0.01 and 0.05 for MAP4. A, Shown are 10 clusters, with the numbers at the top odds ratios within each partition block based on blood pressures. Each row is a SNP, and each column is an individual. SNPs are ordered with decreasing MAFs (from top to bottom). Green vertical bars indicate subjects with higher blood pressures (see text). Genotype aa is plotted in red, aA is plotted in blue, and AA is plotted in white (a denotes the minor allele). The partitions of the 849 individuals are indicated by dotted lines. Most partition elements are driven by similarity on rarer SNPs but not on more common SNPs. B, Clustering of individuals using their SBP curves from the first simulation. It can be seen that individuals are reasonably grouped into 1 high blood pressure cluster and 1 low blood pressure cluster.Biostatistics, Genomicsyl2802, shl5, tz33Biostatistics, StatisticsArticlesNew insights into old methods for identifying causal rare variants
https://academiccommons.columbia.edu/catalog/ac:195277
Hu, Inchi; Zheng, Tian; Huang, Chien-Hsun; Lo, Shaw-Hwa; Wang, Haitianhttp://dx.doi.org/10.7916/D8J38R1MTue, 09 Sep 2014 16:21:21 +0000The advance of high-throughput next-generation sequencing technology makes possible the analysis of rare variants. However, the investigation of rare variants in unrelated-individuals data sets faces the challenge of low power, and most methods circumvent the difficulty by using various collapsing procedures based on genes, pathways, or gene clusters. We suggest a new way to identify causal rare variants using the F-statistic and sliced inverse regression. The procedure is tested on the data set provided by the Genetic Analysis Workshop 17 (GAW17). After preliminary data reduction, we ranked markers according to their F-statistic values. Top-ranked markers were then subjected to sliced inverse regression, and those with higher absolute coefficients in the most significant sliced inverse regression direction were selected. The procedure yields good false discovery rates for the GAW17 data and thus is a promising method for future study on rare variants.Biostatistics, Statistics, Statistics--Methodology, Human genetics--Variation, Biometry--Statistical methodstz33, shl5StatisticsArticlesBAMarray™: Java software for Bayesian analysis of variance for microarray data
https://academiccommons.columbia.edu/catalog/ac:192099
Ishwaran, Hemant; Rao, J. Sunil; Kogalur, Udaya B.http://dx.doi.org/10.7916/D8BR8QNZTue, 09 Sep 2014 00:33:07 +0000Background: DNA microarrays open up a new horizon for studying the genetic determinants of disease. The high throughput nature of these arrays creates an enormous wealth of information, but also poses a challenge to data analysis. Inferential problems become even more pronounced as experimental designs used to collect data become more complex. An important example is multigroup data collected over different experimental groups, such as data collected from distinct stages of a disease process. We have developed a method specifically addressing these issues termed Bayesian ANOVA for microarrays (BAM). The BAM approach uses a special inferential regularization known as spike-and-slab shrinkage that provides an optimal balance between total false detections and total false non-detections. This translates into more reproducible differential calls. Spike and slab shrinkage is a form of regularization achieved by using information across all genes and groups simultaneously.
Results: BAMarray™ is a graphically oriented Java-based software package that implements the BAM method for detecting differentially expressing genes in multigroup microarray experiments (up to 256 experimental groups can be analyzed). Drop-down menus allow the user to easily select between different models and to choose various run options. BAMarray™ can also be operated in a fully automated mode with preselected run options. Tuning parameters have been preset at theoretically optimal values freeing the user from such specifications. BAMarray™ provides estimates for gene differential effects and automatically estimates data adaptive, optimal cutoff values for classifying genes into biological patterns of differential activity across experimental groups. A graphical suite is a core feature of the product and includes diagnostic plots for assessing model assumptions and interactive plots that enable tracking of prespecified gene lists to study such things as biological pathway perturbations. The user can zoom in and lasso genes of interest that can then be saved for downstream analyses.
Conclusion: BAMarray™ is user friendly platform independent software that effectively and efficiently implements the BAM methodology. Classifying patterns of differential activity is greatly facilitated by a data adaptive cutoff rule and a graphical suite. BAMarray™ is licensed software freely available to academic institutions. More information can be found at http://www.bamarray.com.Statistics, Information technology, Bioinformatics, Bayesian statistical decision theory, DNA microarrays--Data processing, Java (Computer program language), Bioinformaticsubk2101StatisticsArticlesA note on QTL detecting for censored traits
https://academiccommons.columbia.edu/catalog/ac:192015
Fang, Yixinhttp://dx.doi.org/10.7916/D8N58JVHTue, 09 Sep 2014 00:32:05 +0000Most existing statistical methods for mapping quantitative trait loci (QTL) assume that the phenotype follows a normal distribution and that it is fully observed. However, some phenotypes have skewed distributions and may be censored. This note proposes a simple and efficient approach to QTL detecting for censored traits with the Cox PH model without estimating the baseline hazard function which is "nuisance".Genetics, Biostatistics, Genetics--Mathematical models, Censored observations (Statistics), Phenotypeyf2113StatisticsArticlesRheumatoid arthritis-associated gene-gene interaction network for rheumatoid arthritis candidate genes
https://academiccommons.columbia.edu/catalog/ac:184531
Huang, Chien-Hsun; Cong, Lei; Xie, Jun; Qiao, Bo; Lo, Shaw-Hwa; Zheng, Tianhttp://dx.doi.org/10.7916/D8HX1B3VMon, 08 Sep 2014 23:31:36 +0000Rheumatoid arthritis (RA, MIM 180300) is a chronic and complex autoimmune disease. Using the North American Rheumatoid Arthritis Consortium (NARAC) data set provided in Genetic Analysis Workshop 16 (GAW16), we used the genotype-trait distortion (GTD) scores and proposed analysis procedures to capture the gene-gene interaction effects of multiple susceptibility gene regions on RA. In this paper, we focused on 27 RA candidate gene regions (531 SNPs) based on a literature search. Statistical significance was evaluated using 1000 permutations. HLADRB1 was found to have strong marginal association with RA. We identified 14 significant interactions (p < 0.01), which were aggregated into an association network among 12 selected candidate genes PADI4, FCGR3, TNFRSF1B, ITGAV, BTLA, SLC22A4, IL3, VEGF, TNF, NFKBIL1, TRAF1-C5, and MIF. Based on our and other contributors' findings during the GAW16 conference, we further studied 24 candidate regions with 336 SNPs. We found 23 significant interactions (p-value < 0.01), nine interactions in addition to our initial findings, and the association network was extended to include candidate genes HLA-A, HLA-B, HLA-C, CTLA4, and IL6. As we will discuss in this paper, the reported possible interactions between genes may suggest potential biological activities of RA.Biostatistics, Geneticsshl5, tz33StatisticsArticlesGenome-wide gene-based analysis of rheumatoid arthritis-associated interaction with PTPN22 and HLA-DRB1
https://academiccommons.columbia.edu/catalog/ac:184526
Qiao, Bo; Huang, Chien Hsun; Chong, Lei; Xie, Jun; Lo, Shaw-Hwa; Zheng, Tianhttp://dx.doi.org/10.7916/D8NP22VMMon, 08 Sep 2014 23:31:27 +0000The genes PTPN22 and HLA-DRB1 have been found by a number of studies to confer an increased risk for rheumatoid arthritis (RA), which indicates that both genes play an important role in RA etiology. It is believed that they not only have strong association with RA individually, but also interact with other related genes that have not been found to have predisposing RA mutations. In this paper, we conduct genome-wide searches for RA-associated gene-gene interactions that involve PTPN22 or HLA-DRB1 using the Genetic Analysis Workshop 16 Problem 1 data from the North American Rheumatoid Arthritis Consortium. MGC13017, HSPCAL3, MIA, PTPNS1L, and IGLVI-70, which showed association with RA in previous studies, have been confirmed in our analysis.Genetics, Biostatisticsshi5, tz33StatisticsArticlesApplying Large-Scale Data and Modern Statistical Methods to Classical Problems in American Politics
https://academiccommons.columbia.edu/catalog/ac:177212
Ghitza, Yairhttp://dx.doi.org/10.7916/D8ZS2TT3Mon, 08 Sep 2014 21:08:28 +0000Exponential growth in data storage and computing capacity, alongside the development of new statistical methods, have facilitated powerful and flexible new research capabilities across a variety of disciplines. In each of these three essays, I use some new large-scale data source or advanced statistical method to address a well-known problem in the American Political Science literature. In the first essay, I build a generational model of presidential voting, in which long-term partisan presidential voting preferences are formed, in large part, through a weighted "running tally" of retrospective presidential evaluations, where weights are determined by the age in which the evaluation was made. By gathering hundreds of thousands of survey responses in combination with a new Bayesian model, I show that the political events of a voter's teenage and early adult years, centered around the age of 18, are enormously influential, particularly among white voters. In the second and third essays, I leverage a national voter registration database, which contains records for over 190 million registered voters, alongside methods like multilevel regression and poststratification (MRP) and coarsened exact matching (CEM) to address critical issues in public opinion research and in our understanding of the consequences of higher or lower turnout. In the process, I make numerous methodological and substantive contributions, including: building on the capabilities of MRP generally, describing methods for dealing with data of this size in the context of social science research, and characterizing mathematical limits of how turnout can impact election outcomes.Political scienceyg2173Political Science, StatisticsDissertationsLimit Theory for Spatial Processes, Bootstrap Quantile Variance Estimators, and Efficiency Measures for Markov Chain Monte Carlo
https://academiccommons.columbia.edu/catalog/ac:188852
Yang, Xuanhttp://dx.doi.org/10.7916/D84X560ZThu, 07 Aug 2014 12:12:31 +0000This thesis contains three topics: (I) limit theory for spatial processes, (II) asymptotic results on the bootstrap quantile variance estimator for importance sampling, and (III) an efficiency measure of MCMC.
(I) First, central limit theorems are obtained for sums of observations from a $\kappa$-weakly dependent random field. In particular, it is considered that the observations are made from a random field at irregularly spaced and possibly random locations. The sums of these samples as well as sums of functions of pairs of the observations are objects of interest; the latter has applications in covariance estimation, composite likelihood estimation, etc. Moreover, examples of $\kappa$-weakly dependent random fields are explored and a method for the evaluation of $\kappa$-coefficients is presented.
Next, statistical inference is considered for the stochastic heteroscedastic processes (SHP) which generalize the stochastic volatility time series model to space. A composite likelihood approach is adopted for parameter estimation, where the composite likelihood function is formed by a weighted sum of pairwise log-likelihood functions. In addition, the observations sites are assumed to distributed according to a spatial point process. Sufficient conditions are provided for the maximum composite likelihood estimator to be consistent and asymptotically normal.
(II) It is often difficult to provide an accurate estimation for the variance of the weighted sample quantile. Its asymptotic approximation requires the value of the density function which may be hard to evaluate in complex systems. To circumvent this problem, the bootstrap estimator is considered. Theoretical results are established for the exact convergence rate and asymptotic distributions of the bootstrap variance estimators for quantiles of weighted empirical distributions. Under regularity conditions, it is shown that the bootstrap variance estimator is asymptotically normal and has relative standard deviation of order O(n^-1/4)
(III) A new performance measure is proposed to evaluate the efficiency of Markov chain Monte Carlo (MCMC) algorithms. More precisely, the large deviations rate of the probability that the Monte Carlo estimator deviates from the true by a certain distance is used as a measure of efficiency of a particular MCMC algorithm. Numerical methods are proposed for the computation of the rate function based on samples of the renewal cycles of the Markov chain. Furthermore the efficiency measure is applied to an array of MCMC schemes to determine their optimal tuning parameters.Statisticsxy2139StatisticsDissertationsStatistical Inference and Experimental Design for Q-matrix Based Cognitive Diagnosis Models
https://academiccommons.columbia.edu/catalog/ac:176169
Zhang, Stephaniehttp://dx.doi.org/10.7916/D8TQ5ZP5Mon, 07 Jul 2014 11:50:59 +0000There has been growing interest in recent years in using cognitive diagnosis models for diagnostic measurement, i.e., classification according to multiple discrete latent traits. The Q-matrix, an incidence matrix specifying the presence or absence of a relationship between each item in the assessment and each latent attribute, is central to many of these models. Important applications include educational and psychological testing; demand in education, for example, has been driven by recent focus on skills-based evaluation. However, compared to more traditional models coming from classical test theory and item response theory, cognitive diagnosis models are relatively undeveloped and suffer from several issues limiting their applicability. This thesis exams several issues related to statistical inference and experimental design for Q-matrix based cognitive diagnosis models.
We begin by considering one of the main statistical issues affecting the practical use of Q-matrix based cognitive diagnosis models, the identifiability issue. In statistical models, identifiability is prerequisite for most common statistical inferences, including parameter estimation and hypothesis testing. With Q-matrix based cognitive diagnosis models, identifiability also affects the classification of respondents according to their latent traits. We begin by examining the identifiability of model parameters, presenting necessary and sufficient conditions for identifiability in several settings.
Depending on the area of application and the researcher's degree of control over the experiment design, fulfilling these identifiability conditions may be difficult. The second part of this thesis proposes new methods for parameter estimation and respondent classification for use with non-identifiable models. In addition, our framework allows consistent estimation of the severity of the non-identifiability problem, in terms of the proportion of the population affected by it. The implications of this measure for the design of diagnostic assessments are also discussed.Statistics, Educational tests and measurements, Quantitative psychology and psychometricsStatisticsDissertationsUnbiased Penetrance Estimates with Unknown Ascertainment Strategies
https://academiccommons.columbia.edu/catalog/ac:175879
Gore, Kristenhttp://dx.doi.org/10.7916/D8KP8098Mon, 07 Jul 2014 11:39:52 +0000Allelic variation in the genome leads to variation in individuals' production of proteins. This, in turn, leads to variation in traits and development, and, in some cases, to diseases. Understanding the genetic basis for disease can aid in the search for therapies and in guiding genetic counseling. Thus, it is of interest to discover the genes with mutations responsible for diseases and to understand the impact of allelic variation at those genes.
A subject's genetic composition is commonly referred to as the subject's genotype. Subjects who carry the gene mutation of interests are referred to as carriers. Subjects who are afflicted with a disease under study (that is, subjects who exhibit the phenotype) are termed affected carriers. The age-specific probability that a given subject will exhibit a phenotype of interest, given mutation status at a gene is known as penetrance.
Understanding penetrance is an important facet of genetic epidemiology. Penetrance estimates are typically calculated via maximum likelihood from family data. However, penetrance estimates can be biased if the nature of the sampling strategy is not correctly reflected in the likelihood. Unfortunately, sampling of family data may be conducted in a haphazard fashion or, even if conducted systematically, might be reported in an incomplete fashion. Bias is possible in applying likelihood methods to reported data if (as is commonly the case) some unaffected family members are not represented in the reports.
The purpose here is to present an approach to find efficient and unbiased penetrance estimates in cases where there is incomplete knowledge of the sampling strategy and incomplete information on the full pedigree structure of families included in the data. The method may be applied with different conjectural assumptions about the ascertainment strategy to balance the possibly biasing effects of wishful assumptions about the sampling strategy with the efficiency gains that could be obtained through valid assumptions.StatisticsStatisticsDissertationsToward Reproducible Computational Research: An Empirical Analysis of Data and Code Policy Adoption by Journals
https://academiccommons.columbia.edu/catalog/ac:174140
Stodden, Victoria C.; Guo, Peixuan; Ma, Zhaokunhttp://dx.doi.org/10.7916/D80K26NNWed, 21 May 2014 11:58:15 +0000Journal policy on research data and code availability is an important part of the ongoing shift toward publishing reproducible computational science. This article extends the literature by studying journal data sharing policies by year (for both 2011 and 2012) for a referent set of 170 journals. We make a further contribution by evaluating code sharing policies, supplemental materials policies, and open access status for these 170 journals for each of 2011 and 2012. We build a predictive model of open data and code policy adoption as a function of impact factor and publisher and find higher impact journals more likely to have open data and code policies and scientific societies more likely to have open data and code policies than commercial publishers. We also find open data policies tend to lead open code policies, and we find no relationship between open data and code policies and either supplemental material policies or open access journal status. Of the journals in this study, 38% had a data policy, 22% had a code policy, and 66% had a supplemental materials policy as of June 2012. This reflects a striking one year increase of 16% in the number of data policies, a 30% increase in code policies, and a 7% increase in the number of supplemental materials policies. We introduce a new dataset to the community that categorizes data and code sharing, supplemental materials, and open access policies in 2011 and 2012 for these 170 journals.Technical communication, Information sciencevcs2115, zm2168StatisticsArticlesBook Reviews: Principles of Data Mining. By David Hand, Heikki Mannila, and Padhraic Smyth.
https://academiccommons.columbia.edu/catalog/ac:173915
Madigan, David B.http://dx.doi.org/10.7916/D8DZ06D8Thu, 15 May 2014 12:45:12 +0000"Principles of Data Mining. By David Hand, Heikki Mannila, and Padhraic Smyth. MIT Press, Cambridge, MA, 2001. $50.00. xxxii+546 pp., hardcover. ISBN 0-262-08290-X. Is data mining the same as statistics? The distinguished authors of Principles of Data Mining struggle to make a distinction between the two subjects. In the end, what they have written is a fine applied statistics text." -- page 501Statisticsdm2418StatisticsReviewsMedication-Wide Association Studies
https://academiccommons.columbia.edu/catalog/ac:173912
Ryan, P. B.; Stang, P. E.; Madigan, David B.; Schuemie, M. J.; Hripcsak, George M.http://dx.doi.org/10.7916/D8PG1PVXThu, 15 May 2014 12:30:39 +0000Undiscovered side effects of drugs can have a profound effect on the health of the nation, and electronic health-care databases offer opportunities to speed up the discovery of these side effects. We applied a “medication-wide association study” approach that combined multivariate analysis with exploratory visualization to study four health outcomes of interest in an administrative claims database of 46 million patients and a clinical database of 11 million patients. The technique had good predictive value, but there was no threshold high enough to eliminate false-positive findings. The visualization not only highlighted the class effects that strengthened the review of specific products but also underscored the challenges in confounding. These findings suggest that observational databases are useful for identifying potential associations that warrant further consideration but are unlikely to provide definitive evidence of causal effects.Pharmacology, Statistics, Bioinformaticsdm2418, gh13Statistics, Biomedical InformaticsArticlesAlgorithms for Sparse Linear Classifiers in the Massive Data Setting
https://academiccommons.columbia.edu/catalog/ac:173908
Balakrishnan, Suhrid; Bartlett, Peter; Madigan, David B.http://dx.doi.org/10.7916/D8Z0368XThu, 15 May 2014 12:25:33 +0000Classifiers favoring sparse solutions, such as support vector machines, relevance vector machines, LASSO-regression based classifiers, etc., provide competitive methods for classification problems in high dimensions. However, current algorithms for training sparse classifiers typically scale quite unfavorably with respect to the number of training examples. This paper proposes online and multi-pass algorithms for training sparse linear classifiers for high dimensional data. These algorithms have computational complexity and memory requirements that make learning on massive data sets feasible. The central idea that makes this possible is a straightforward quadratic approximation to the likelihood function.Statistics, Artificial intelligencedm2418StatisticsArticlesLearning Theory Analysis for Association Rules and Sequential Event Prediction
https://academiccommons.columbia.edu/catalog/ac:173905
Rudin, Cynthia; Letham, Benjamin; Madigan, David B.http://dx.doi.org/10.7916/D82N50C1Thu, 15 May 2014 12:19:33 +0000We present a theoretical analysis for prediction algorithms based on association rules. As part of this analysis, we introduce a problem for which rules are particularly natural, called “sequential event prediction." In sequential event prediction, events in a sequence are revealed one by one, and the goal is to determine which event will next be revealed. The training set is a collection of past sequences of events. An example application is to predict which item will next be placed into a customer's online shopping cart, given his/her past purchases. In the context of this problem, algorithms based on association rules have distinct advantages over classical statistical and machine learning methods: they look at correlations based on subsets of co-occurring past events (items a and b imply item c), they can be applied to the sequential event prediction problem in a natural way, they can potentially handle the “cold start" problem where the training set is small, and they yield interpretable predictions. In this work, we present two algorithms that incorporate association rules. These algorithms can be used both for sequential event prediction and for supervised classification, and they are simple enough that they can possibly be understood by users, customers, patients, managers, etc. We provide generalization guarantees on these algorithms based on algorithmic stability analysis from statistical learning theory. We include a discussion of the strict minimum support threshold often used in association rule mining, and introduce an “adjusted confidence" measure that provides a weaker minimum support condition that has advantages over the strict minimum support. The paper brings together ideas from statistical learning theory, association rule mining and Bayesian analysis.Statistics, Artificial intelligencedm2418StatisticsArticlesAnalysis of Variance of Cross-Validation Estimators of the Generalization Error
https://academiccommons.columbia.edu/catalog/ac:173902
Markatou, Marianthi; Tian, Hong; Biswas, Shameek; Hripcsak, George M.http://dx.doi.org/10.7916/D86D5R2XThu, 15 May 2014 11:58:33 +0000This paper brings together methods from two different disciplines: statistics and machine learning. We address the problem of estimating the variance of cross-validation (CV) estimators of the generalization error. In particular, we approach the problem of variance estimation of the CV estimators of generalization error as a problem in approximating the moments of a statistic. The approximation illustrates the role of training and test sets in the performance of the algorithm. It provides a unifying approach to evaluation of various methods used in obtaining training and test sets and it takes into account the variability due to different training and test sets. For the simple problem of predicting the sample mean and in the case of smooth loss functions, we show that the variance of the CV estimator of the generalization error is a function of the moments of the random variables Y=Card(Sj ∩ Sj') and Y*=Card(Sjc ∩ Sj'c), where Sj, Sj' are two training sets, and Sjc, Sj'c are the corresponding test sets. We prove that the distribution of Y and Y* is hypergeometric and we compare our estimator with the one proposed by Nadeau and Bengio (2003). We extend these results in the regression case and the case of absolute error loss, and indicate how the methods can be extended to the classification case. We illustrate the results through simulation.Statistics, Artificial intelligencemm168, ht2031, spb2003, gh13Biostatistics, Biomedical Informatics, StatisticsArticlesA One-Pass Sequential Monte Carlo Method for Bayesian Analysis of Massive Datasets
https://academiccommons.columbia.edu/catalog/ac:173899
Balakrishnan, Suhrid; Madigan, David B.http://dx.doi.org/10.7916/D8B56GTPThu, 15 May 2014 11:51:51 +0000For Bayesian analysis of massive data, Markov chain Monte Carlo (MCMC) techniques often prove infeasible due to computational resource constraints. Standard MCMC methods generally require a complete scan of the dataset for each iteration. Ridgeway and Madigan (2002) and Chopin (2002b) recently presented importance sampling algorithms that combined simulations from a posterior distribution conditioned on a small portion of the dataset with a reweighting of those simulations to condition on the remainder of the dataset. While these algorithms drastically reduce the number of data accesses as compared to traditional MCMC, they still require substantially more than a single pass over the dataset. In this paper, we present "1PFS," an efficient, one-pass algorithm. The algorithm employs a simple modification of the Ridgeway and Madigan (2002) particle filtering algorithm that replaces the MCMC based "rejuvenation" step with a more efficient "shrinkage" kernel smoothing based step. To show proof-of-concept and to enable a direct comparison, we demonstrate 1PFS on the same examples presented in Ridgeway and Madigan (2002), namely a mixture model for Markov chains and Bayesian logistic regression. Our results indicate the proposed scheme delivers accurate parameter estimates while employing only a single pass through the data.Mathematics, Statisticsdm2418StatisticsArticlesA Characterization of Markov Equivalence Classes for Acyclic Digraphs
https://academiccommons.columbia.edu/catalog/ac:173896
Andersson, Steen A.; Madigan, David B.; Perlman, Michael D.http://dx.doi.org/10.7916/D8FX77J3Thu, 15 May 2014 11:28:36 +0000Undirected graphs and acyclic digraphs (ADG's), as well as their mutual extension to chain graphs, are widely used to describe dependencies among variables in multiviarate distributions. In particular, the likelihood functions of ADG models admit convenient recursive factorizations that often allow explicit maximum likelihood estimates and that are well suited to building Bayesian networks for expert systems. Whereas the undirected graph associated with a dependence model is uniquely determined, there may be many ADG's that determine the same dependence (i.e., Markov) model. Thus, the family of all ADG's with a given set of vertices is naturally partitioned into Markov-equivalence classes, each class being associated with a unique statistical model. Statistical procedures, such as model selection of model averaging, that fail to take into account these equivalence classes may incur substantial computational or other inefficiences. Here it is show that each Markov-equivalence class is uniquely determined by a single chain graph, the essential graph, that is itself simultaneously Markov equivalent to all ADG's in the equivalence class. Essential graphs are characterized, a polynomial-time algorithm for their construction is given, and their applications to model selection and other statistical questions are described.Mathematics, Statistics, Theoretical mathematicsdm2418StatisticsArticlesCorrection: Separation and completeness properties for AMP chain graph Markov models
https://academiccommons.columbia.edu/catalog/ac:173887
Madigan, David B.; Levitz, Michael; Perlman, Michael D.http://dx.doi.org/10.7916/D8QF8R05Wed, 14 May 2014 19:42:28 +0000Correction of table 2 on page 1757 of 'Separation and completeness properties for AMP chain graph Markov models', Annals of Statistics, volume 29 (2001).Mathematics, Statisticsdm2418StatisticsArticlesBayesian Hierarchical Rule Modeling for Predicting Medical Conditions
https://academiccommons.columbia.edu/catalog/ac:173882
McCormick, Tyler H.; Rudin, Cynthia; Madigan, David B.http://dx.doi.org/10.7916/D8V69GP1Wed, 14 May 2014 19:02:36 +0000We propose a statistical modeling technique, called the Hierarchical Association Rule Model (HARM), that predicts a patient’s possible future medical conditions given the patient’s current and past history of reported conditions. The core of our technique is a Bayesian hierarchical model for selecting predictive association rules (such as “condition 1 and condition 2 → condition 3”) from a large set of candidate rules. Because this method “borrows strength” using the conditions of many similar patients, it is able to provide predictions specialized to any given patient, even when little information about the patient’s history of conditions is available.Applied mathematics, Statistics, Medicinedm2418StatisticsArticles[Bayesian Analysis in Expert Systems]: Comment: What's Next?
https://academiccommons.columbia.edu/catalog/ac:173856
Madigan, David B.http://dx.doi.org/10.7916/D8W37TFJTue, 13 May 2014 17:59:40 +0000"These papers represent two of the many different graphical modeling camps that have emerged from a flurry of activity in the past decade. The paper by Cox and Wermuth falls within the statistical graphical modeling camp and provides a useful generalization of that body of work. There is, of course, a price to be paid for this generality, namely that the interpretation of the graphs is more complex...The paper by Spiegelhalter, Dawid, Lauritzen and Cowell falls within the probabilistic expert system camp. This is a tour de force by researchers responsible for much of the astonishing progress in this area. Ten years ago, probabilistic models were shunned by the artificial intelligence community. That they are now widely accepted and used is due in large measure to the insights and efforts of these authors, along with other pioneers such as Judea Pearl and Peter Cheeseman..." -- page 261Mathematics, Statisticsdm2418StatisticsArticlesBayesian Model Averaging: a Tutorial (with Comments by M. Clyde, David Draper and E. I. George, and a Rejoinder by the Authors)
https://academiccommons.columbia.edu/catalog/ac:173853
Hoeting, Jennifer A.; Madigan, David B.; Raftery, Adrian E.; Volinsky, Chris T.; Clyde, M.; Draper, David; George, E. I.http://dx.doi.org/10.7916/D84M92N7Tue, 13 May 2014 17:39:49 +0000Standard statistical practice ignores model uncertainty. Data analysts typically select a model from some class of models and then proceed as if the selected model had generated the data. This approach ignores the uncertainty in model selection, leading to over-confident inferences and decisions that are more risky than one thinks they are. Bayesian model averaging (BMA)provides a coherent mechanism for accounting for this model uncertainty. Several methods for implementing BMA have recently emerged. We discuss these methods and present a number of examples.In these examples, BMA provides improved out-of-sample predictive performance. We also provide a catalogue of currently available BMA software.Statisticsdm2418StatisticsArticles[A Report on the Future of Statistics]: Comment
https://academiccommons.columbia.edu/catalog/ac:173850
Madigan, David B.; Stuetzle, Wernerhttp://dx.doi.org/10.7916/D8D50K3VTue, 13 May 2014 17:28:46 +0000"Extraordinary opportunities for statistical ideas and for statisticians now present themselves. However, to take advantage of the opportunities, statistics has to change the way in which it recruits and trains students. Statistics has primarily focused on squeezing the maximum amount of information out of limited data. This paradigm is rapidly diminishing in importance and statistics education finds itself out of step with reality. The problems begin at the high school and undergraduate levels, where the standard course includes a narrow set of pre-computing-era topics. At the graduate level, the typical statistics program suffers from the same problem..." -- page 408Mathematics education, Higher educationdm2418StatisticsArticlesSeparation and Completeness Properties for Amp Chain Graph Markov Models
https://academiccommons.columbia.edu/catalog/ac:173847
Levitz, Michael; Perlman, Michael D.; Madigan, David B.http://dx.doi.org/10.7916/D8X34VJGTue, 13 May 2014 16:30:46 +0000Pearl’s well-known d-separation criterion for an acyclic directed graph (ADG) is a pathwise separation criterion that can be used to efficiently identify all valid conditional independence relations in the Markov model determined by the graph. This paper introduces p-separation, a pathwise separation criterion that efficiently identifies all valid conditional independences under the Andersson–Madigan–Perlman (AMP) alternative Markov property for chain graphs (= adicyclic graphs), which include both ADGs and undirected graphs as special cases. The equivalence of p-separation to the augmentation criterion occurring in the AMP global Markov property is established, and p-separation is applied to prove completeness of the global Markov property for AMP chain graph models. Strong completeness of the AMP Markov property is established, that is, the existence of Markov perfect distributions that satisfy those and only those conditional independences implied by the AMP property(equivalently, by p-separation). A linear-time algorithm for determining p-separation is presented.Mathematics, Statistics, Theoretical mathematicsdm2418StatisticsArticles[Least Angle Regression]: Discussion
https://academiccommons.columbia.edu/catalog/ac:173841
Madigan, David B.; Ridgeway, Greghttp://dx.doi.org/10.7916/D81V5C29Tue, 13 May 2014 16:15:23 +0000Algorithms for simultaneous shrinkage and selection in regression and classification provide attractive solutions to knotty old statistical challenges. Nevertheless, as far as we can tell, Tibshirani's Lasso algorithm has had little impact on statistical practice. Two particular reasons for this may be the relative inefficiency of the original Lasso algorithm and the relative complexity of more recent Lasso algorithms [e.g., Osborne, Presnell and Turlach (2000)]. Efron, Hastie, Johnstone and Tibshirani have provided an efficient, simple algorithm for the Lasso as well as algorithms for stagewise regression and the new least angle regression. As such this paper is an important contribution to statistical computing.Mathematics, Statisticsdm2418StatisticsArticlesA Hierarchical Model for Association Rule Mining of Sequential Events: An Approach to Automated Medical Symptom Prediction
https://academiccommons.columbia.edu/catalog/ac:173838
McCormick, Tyler H.; Rudin, Cynthia; Madigan, David B.http://dx.doi.org/10.7916/D89C6VJDTue, 13 May 2014 15:27:01 +0000In many healthcare settings, patients visit healthcare professionals periodically and report multiple medical conditions, or symptoms, at each encounter. We propose a statistical modeling technique, called the Hierarchical Association Rule Model (HARM), that predicts a patient’s possible future symptoms given the patient’s current and past history of reported symptoms. The core of our technique is a Bayesian hierarchical model for selecting predictive association rules (such as “symptom 1 and symptom 2 → symptom 3 ”) from a large set of candidate rules. Because this method “borrows strength” using the symptoms of many similar patients, it is able to provide predictions specialized to any given patient, even when little information about the patient’s history of symptoms is available.Mathematics, Statistics, Medicinedm2418StatisticsArticlesGenerating Productive Dialogue between Consulting Statisticians and their Clients in the Pharmaceutical and Medical Research Settings
https://academiccommons.columbia.edu/catalog/ac:173832
Emir, Birol; Amaratunga, Dhammika; Beltangady, Mohan; Cabrera, Javier; Freeman, Roy; Madigan, David B.; Nguyen, Ha H.; Whalen, Edward Patrickhttp://dx.doi.org/10.7916/D8PK0D8NTue, 13 May 2014 15:09:40 +0000Due to the ever-increasing complexity of scientific technologies and resulting data, consulting statisticians are becoming more involved in the design, conduct, and analysis of biomedical research. This requires extensive collaboration between the consulting statistician and nonstatisticians, such as researchers, clinicians, and corporate executives. Consequently, a successful consulting career is becoming ever more dependent on the statistician's ability to effectively communicate with nonstatisticians. This is especially true when more complex, nontraditional analytical methods are required. In this paper, we examine the collaboration between statisticians and nonstatisticians from three different professional perspectives. Integrating these perspectives, we discuss ways to help the consulting statistician generate productive dialogue with clients. Finally, we examine how universities can better prepare students for careers in statistical consulting by incorporating more communication-based elements into their curriculum and by offering students ample opportunities to collaborate with nonstatisticians. Overall, we designed this exercise to help the consulting statistician generate dialogue with clients that results in more productive collaborations and a more satisfying work experience.Statistics, Bioinformatics, Medicinebe2166, dm2418, hhn2108, ew2320StatisticsArticlesA Note on Equivalence Classes of Directed Acyclic Independence Graphs
https://academiccommons.columbia.edu/catalog/ac:173826
Madigan, David B.http://dx.doi.org/10.7916/D8TB150CTue, 13 May 2014 14:46:04 +0000Directed acyclic independence graphs (DAIGs) play an important role in recent developments in probabilistic expert systems and influence diagrams (Chyu [1]). The purpose of this note is to show that DAIGs can usefully be grouped into equivalence classes where the members of a single class share identical Markov properties. These equivalence classes can be identified via a simple graphical criterion. This result is particularly relevant to model selection procedures for DAIGs (see, e.g., Cooper and Herskovits [2] and Madigan and Raftery [4]) because it reduces the problem of searching among possible orientations of a given graph to that of searching among the equivalence classes.Mathematics, Statisticsdm2418StatisticsArticlesLocation Estimation in Wireless Networks: A Bayesian Approach
https://academiccommons.columbia.edu/catalog/ac:173820
Madigan, David B.; Ju, Wen-Hua; Krishnan, P.; Krishnakumar, A. S. ; Zorych, Ivanhttp://dx.doi.org/10.7916/D82V2D74Tue, 13 May 2014 14:25:34 +0000We present a Bayesian hierarchical model for indoor location estimation in wireless networks. We demonstrate that out model achieves accuracy that is similar to other published models and algorithms. By harnessing prior knowledge, our model drastically reduces the requirement for training data as compared with existing approaches.Mathematics, Statistics, Applied mathematicsdm2418StatisticsArticles