Academic Commons Search Results
http://academiccommons.columbia.edu/catalog.rss?f%5Blanguage%5D%5B%5D=English&f%5Bsubject_facet%5D%5B%5D=Statistics&q=&rows=500&sort=record_creation_date+desc
Academic Commons Search Resultsen-usConditional Exceedance Probabilities
http://academiccommons.columbia.edu/catalog/ac:196847
Mason, Simon J.; Galpin, Jacqueline S.; Goddard, Lisa M.; Graham, Nicholas E.; Rajaratnam, Balakanapathyhttp://dx.doi.org/10.7916/D8PK0G2SThu, 07 Apr 2016 00:00:00 +0000Probabilistic forecasts of variables measured on a categorical or ordinal scale, such as precipitation occurrence or temperatures exceeding a threshold, are typically verified by comparing the relative frequency with which the target event occurs given different levels of forecast confidence. The degree to which this conditional (on the forecast probability) relative frequency of an event corresponds with the actual forecast probabilities is known as reliability, or calibration. Forecast reliability for binary variables can be measured using the Murphy decomposition of the (half) Brier score, and can be presented graphically using reliability and attributes diagrams. For forecasts of variables on continuous scales, however, an alternative measure of reliability is required. The binned probability histogram and the reliability component of the continuous ranked probability score have been proposed as appropriate verification procedures in this context, but are subject to some limitations. A procedure is proposed that is applicable in the context of forecast ensembles and is an extension of the binned probability histogram. Individual ensemble members are treated as estimates of quantiles of the forecast distribution, and the conditional probability that the observed precipitation, for example, exceeds the amount forecast [the conditional exceedance probability (CEP)] is calculated. Generalized linear regression is used to estimate these conditional probabilities. A diagram showing the CEPs for ranked ensemble members is suggested as a useful method for indicating reliability when forecasts are on a continuous scale, and various statistical tests are suggested for quantifying the reliability.Atmospheric sciences, Climatic changes--Forecasting, Climatic changes--Mathematical models, Statisticssjm2103, lmg107International Research Institute for Climate and SocietyArticlesAssessing the predictability of extreme rainfall seasons over southern Africa
http://academiccommons.columbia.edu/catalog/ac:196917
Landman, Willem A.; Botes, Stephanie; Goddard, Lisa M.; Shongwe, Mxolisihttp://dx.doi.org/10.7916/D8B56JPQThu, 07 Apr 2016 00:00:00 +0000A model output statistics (MOS) technique is developed to investigate the potential rainfall forecast skill for extreme seasons over southern Africa. Rainfall patterns produced by the ECHAM4.5 atmospheric GCM are statistically recalibrated to regional rainfall for the seasons of September–November, December–February, March–May and June–August. Archived records of the GCM simulated fields are related to observed rainfall through a set of canonical correlation analysis (CCA) equations. Probabilistic forecast skill (RPSS and ROC) of MOS-recalibrated simulations for 5 equi-probable categories is assessed using a 3-year-out cross-validation approach. High skill RPSS values are found for the DJF and MAM seasons. Although ROC scores for DJF and MAM are larger than 0.5 for all categories (scores less than 0.5 suggest negative skill), scores for DJF show that the extreme categories are more predictable than the inner categories and scores for MAM show that skill is mostly associated with the extremely wet category. The GCM's ability to reproduce tropical-temperate trough variability constitutes the main source of predictability for DJF and MAM.Atmospheric sciences, Precipitation forecasting, Africa, Southern, Atmospheric circulation, Statisticswal2113, lmg107International Research Institute for Climate and SocietyArticlesReply
http://academiccommons.columbia.edu/catalog/ac:196838
Mason, Simon J.; Tippett, Michael K.; Weigel, Andreas P.; Goddard, Lisa M.; Rajaratnam, Balakanapathyhttp://dx.doi.org/10.7916/D8Z31ZKBThu, 07 Apr 2016 00:00:00 +0000Reply to a comment on the article: Conditional Exceedance Probabilities. Monthly Weather Review 135 (2010), 363–372 (available in Academic Commons at http://dx.doi.org/10.7916/D8PK0G2S).Atmospheric sciences, Climatic changes--Forecasting, Climatic changes--Mathematical models, Statisticssjm2103, mkt14, lmg107Applied Physics and Applied Mathematics, International Research Institute for Climate and SocietyArticlesStatistical–Dynamical Seasonal Forecasts of Central-Southwest Asian Winter Precipitation
http://academiccommons.columbia.edu/catalog/ac:196890
Tippett, Michael K.; Goddard, Lisa M.; Barnston, Anthony G.http://dx.doi.org/10.7916/D8NK3DZFThu, 07 Apr 2016 00:00:00 +0000Interannual precipitation variability in central-southwest (CSW) Asia has been associated with East Asian jet stream variability and western Pacific tropical convection. However, atmospheric general circulation models (AGCMs) forced by observed sea surface temperature (SST) poorly simulate the region’s interannual precipitation variability. The statistical–dynamical approach uses statistical methods to correct systematic deficiencies in the response of AGCMs to SST forcing. Statistical correction methods linking model-simulated Indo–west Pacific precipitation and observed CSW Asia precipitation result in modest, but statistically significant, cross-validated simulation skill in the northeast part of the domain for the period from 1951 to 1998. The statistical–dynamical method is also applied to recent (winter 1998/99 to 2002/03) multimodel, two-tier December–March precipitation forecasts initiated in October. This period includes 4 yr (winter of 1998/99 to 2001/02) of severe drought. Tercile probability forecasts are produced using ensemble-mean forecasts and forecast error estimates. The statistical–dynamical forecasts show enhanced probability of below-normal precipitation for the four drought years and capture the return to normal conditions in part of the region during the winter of 2002/03.Atmospheric sciences, Climatic changes--Forecasting, Precipitation forecasting, Middle East, Asia, Central, Statisticsmkt14, lmg107, agb52Applied Physics and Applied Mathematics, International Research Institute for Climate and SocietyArticlesPredicting southern African summer rainfall using a combination of MOS and perfect prognosis
http://academiccommons.columbia.edu/catalog/ac:196920
Landman, Willem A.; Goddard, Lisa M.http://dx.doi.org/10.7916/D8959HHWThu, 07 Apr 2016 00:00:00 +0000A statistical-dynamical approach to probabilistic precipitation forecasts of southern African summer rainfall is described and validated. An ensemble of seasonal precipitation and circulation fields is obtained from the ECHAM4.5 atmospheric general circulation model (AGCM). Model output statistics (MOS) then spatially recalibrate the AGCM fields relative to observations. Although the MOS equations are built using the simulation data, in which observed SSTs force the AGCM, the same set of equations can be applied to the predicted data, in which predicted SSTs force the AGCM. The use of prediction data in a set of equations developed for simulations, assumes that the AGCM forecast skill approximates its simulation skill and that the systematic biases of the AGCM do not change in a prediction setting; this assumption is analogous to a perfect prognosis (PP) approach. Probabilistic forecast skill is assessed using this MOS-PP-recalibration scheme for 3 equi-probable categories using a 3-year-out cross-validation approach. High skill scores are found over the north-eastern interior of the region, with marginal skill over the remainder of the austral summer rainfall regions. When skill is assessed for only the wettest and driest of the years, high skill appears over most of the region.Atmospheric sciences, Precipitation forecasting, Africa, Southern, Atmospheric circulation, Statisticswal2113, lmg107International Research Institute for Climate and SocietyArticlesNew perspectives on learning, inference, and control in brains and machines
http://academiccommons.columbia.edu/catalog/ac:196425
Merel, Joshua Scotthttp://dx.doi.org/10.7916/D8C8296CWed, 16 Mar 2016 18:35:32 +0000The work presented in this thesis provides new perspectives and approaches for problems that arise in the analysis of neural data. Particular emphasis is placed on parameter fitting and automated analysis problems that would arise naturally in closed-loop experiments. Part one focuses on two brain-computer interface problems. First, we provide a framework for understanding co-adaptation, the setting in which decoder updating and user learning occur simultaneously. We also provide a new perspective on intention-based parameter fitting and tools to extend this approach to higher dimensional decoders. Part two focuses on event inference, which refers to the decomposition of observed timeseries data into interpretable events. We present application of event inference methods on voltage-clamp recordings as well as calcium imaging, and describe extensions to allow for combining data across modalities or trials.Neurosciences, Statisticsjsm2183Neurobiology and Behavior, StatisticsDissertationsPrior Design for Dependent Dirichlet Processes: An Application to Marathon Modeling
http://academiccommons.columbia.edu/catalog/ac:195557
Pradier, Melanie F.; Ruiz, Francisco Jesus Rodriguez; Perez-Cruz, Fernandohttp://dx.doi.org/10.7916/D8SN08V7Mon, 14 Mar 2016 00:00:00 +0000This paper presents a novel application of Bayesian nonparametrics (BNP) for marathon data modeling. We make use of two well-known BNP priors, the single-p dependent Dirichlet process and the hierarchical Dirichlet process, in order to address two different problems. First, we study the impact of age, gender and environment on the runners’ performance. We derive a fair grading method that allows direct comparison of runners regardless of their age and gender. Unlike current grading systems, our approach is based not only on top world records, but on the performances of all runners. The presented methodology for comparison of densities can be adopted in many other applications straightforwardly, providing an interesting perspective to build dependent Dirichlet processes. Second, we analyze the running patterns of the marathoners in time, obtaining information that can be valuable for training purposes. We also show that these running patterns can be used to predict finishing time given intermediate interval measurements. We apply our models to New York City, Boston and London marathons.Statistics, Information science, Stochastic processes, Marathon running, Running races--Data processing, Nonparametric statisticsfr2392Data Science InstituteArticlesDynamics of Large Rank-Based Systems of Interacting Diffusions
http://academiccommons.columbia.edu/catalog/ac:195668
Bruggeman, Cameronhttp://dx.doi.org/10.7916/D80G3K1GThu, 10 Mar 2016 00:00:00 +0000We study systems of n dimensional diffusions whose drift and dispersion coefficients depend only on the relative ranking of the processes. We consider the question of how long it takes for a particle to go from one rank to another. It is argued that as n gets large, the distribution of particles satisfies a Porous Medium Equation. Using this, we derive a deterministic limit for the system of particles. This limit allows for direct calculation of the properties of the rank traversal time. The results are extended to the case of asymmetrically colliding particles. These models are of interest in the study of financial markets and economic inequality. In particular, we derive limits for the performance of some Functionally Generated Portfolios originating from Stochastic Portfolio Theory.Mathematics, Statistics, Diffusion processes, Diffusion--Mathematical models, Dispersion--Mathematical models, Porous materials--Mathematical models, Portfolio management--Mathematical modelscpb2133MathematicsDissertationsSemi-convergence of an Iterative Algorithm
http://academiccommons.columbia.edu/catalog/ac:194857
Vasilaky, Kathryn N.http://dx.doi.org/10.7916/D8SJ1KFXFri, 26 Feb 2016 00:00:00 +0000An iterative method is introduced for solving noisy, ill-conditioned inverse problems. Analysis of the semi-convergence behavior identifies three error components - iteration error, noise error, and initial guess error. A derived expression explains how the three errors are related to each other relative to the number of iterations. The Standard Tikhonov regularization method is just the first iteration of the iterative method and the derived noise damping filter is a generalization of the Standard Tikhonov filter. The derived filter is a function two parameters, a regularization parameter and the iteration number parameter. The new method is tested on image reconstruction from projections simulated data set.Statistics, Mathematics, Inverse problems (Differential equations), Iterative methods (Mathematics), Filters (Mathematics)knv4Earth InstituteReportsMethods for Personalized and Evidence Based Medicine
http://academiccommons.columbia.edu/catalog/ac:195007
Shahn, Zachhttp://dx.doi.org/10.7916/D8M0458SWed, 24 Feb 2016 00:00:00 +0000There is broad agreement that medicine ought to be `evidence based' and `personalized' and that data should play a large role in achieving both these goals. But the path from data to improved medical decision making is not clear. This thesis presents three methods that hopefully help in small ways to clear the path. Personalized medicine depends almost entirely on understanding variation in treatment effect. Chapter 1 describes latent class mixture models for treatment effect heterogeneity that distinguish between continuous and discrete heterogeneity, use hierarchical shrinkage priors to mitigate overfitting and multiple comparisons concerns, and employ flexible error distributions to improve robustness. We apply different versions of these models to reanalyze a clinical trial comparing HIV treatments and a natural experiment on the effect of Medicaid on emergency department utilization. Medical decisions often depend on observational studies performed on large longitudinal health insurance claims databases. These studies usually claim to identify a causal effect, but empirical evaluations have demonstrated that standard methods for causal discovery perform poorly in this context, most likely in large part due to the presence of unobserved confounding. Chapter 2 proposes an algorithm called Ensembles of Granger Graphs (EGG) that does not rely on the assumption that unobserved confounding is absent. In a simulation and experiments on a real claims database, EGG is robust to confounding, has high positive predictive value, and has high power to detect strong causal effects. While decision making inherently involves causal inference, purely predictive models aid many medical decisions in practice. Predictions from health histories are challenging because the space of possible predictors is so vast. Not only are there thousands of health events to consider, but also their temporal interactions. In Chapter 3, we adapt a method originally developed for speech recognition that greedily constructs informative labeled graphs representing temporal relations between multiple health events at the nodes of randomized decision trees. We use this method to predict strokes in patients with atrial fibrillation using data from a Medicaid claims database. I hope the ideas illustrated in these three projects inspire work that someday genuinely improves healthcare. I also include a short `bonus' chapter on an improved estimate of effective sample size in importance sampling. This chapter is not directly related to medicine, but finds a home in this thesis nonetheless.Statistics, Medical care--Statistics, Evidence-based medicine, Personalized medicinezss2101StatisticsDissertationsStatistics of surface divergence and their relation to air-water gas transfer velocity
http://academiccommons.columbia.edu/catalog/ac:194442
Asher, William E.; Liang, Hanzhuang; Zappa, Christopher J.; Loewen, Mark R.; Mukto, Moniz A.; Litchendorf, Trina M.; Jessup, Andrew T.http://dx.doi.org/10.7916/D8571BVQMon, 22 Feb 2016 00:00:00 +0000Air-sea gas fluxes are generally defined in terms of the air/water concentration difference of the gas and the gas transfer velocity,kL. Because it is difficult to measure kLin the ocean, it is often parameterized using more easily measured physical properties. Surface divergence theory suggests that infrared (IR) images of the water surface, which contain information concerning the movement of water very near the air-water interface, might be used to estimatekL. Therefore, a series of experiments testing whether IR imagery could provide a convenient means for estimating the surface divergence applicable to air-sea exchange were conducted in a synthetic jet array tank embedded in a wind tunnel. Gas transfer velocities were measured as a function of wind stress and mechanically generated turbulence; laser-induced fluorescence was used to measure the concentration of carbon dioxide in the top 300 μm of the water surface; IR imagery was used to measure the spatial and temporal distribution of the aqueous skin temperature; and particle image velocimetry was used to measure turbulence at a depth of 1 cm below the air-water interface. It is shown that an estimate of the surface divergence for both wind-shear driven turbulence and mechanically generated turbulence can be derived from the surface skin temperature. The estimates derived from the IR images are compared to velocity field divergences measured by the PIV and to independent estimates of the divergence made using the laser-induced fluorescence data. Divergence is shown to scale withkLvalues measured using gaseous tracers as predicted by conceptual models for both wind-driven and mechanically generated turbulence.Physical oceanography, Mathematics, Statistics, Surface waves (Oceanography), Ocean-atmosphere interaction, Divergence theorem, Gas flow--Mathematical modelscjz9Lamont-Doherty Earth ObservatoryArticlesAre We Ready for Mass Fatality Incidents? Preparedness of the US Mass Fatality Infrastructure
http://academiccommons.columbia.edu/catalog/ac:192811
Merrill, Jacqueline A.; Orr, Mark; Chen, Daniel; Zhi, Qi; Gershon, Robynhttp://dx.doi.org/10.7916/D8125SF8Fri, 08 Jan 2016 00:00:00 +0000Objective To assess the preparedness of the US mass fatality infrastructure, we developed and tested metrics for 3 components of preparedness: organizational, operational, and resource sharing networks. Methods In 2014, data were collected from 5 response sectors: medical examiners and coroners, the death care industry, health departments, faith-based organizations, and offices of emergency management. Scores were calculated within and across sectors and a weighted score was developed for the infrastructure. Results A total of 879 respondents reported highly variable organizational capabilities: 15% had responded to a mass fatality incident (MFI); 42% reported staff trained for an MFI, but only 27% for an MFI involving hazardous contaminants. Respondents estimated that 75% of their staff would be willing and able to respond, but only 53% if contaminants were involved. Most perceived their organization as somewhat prepared, but 13% indicated “not at all.” Operational capability scores ranged from 33% (death care industry) to 77% (offices of emergency management). Network capability analysis found that only 42% of possible reciprocal relationships between resource-sharing partners were present. The cross-sector composite score was 51%; that is, half the key capabilities for preparedness were in place. Conclusions The sectors in the US mass fatality infrastructure report suboptimal capability to respond. National leadership is needed to ensure sector-specific and infrastructure-wide preparedness for a large-scale MFI.Health sciences, Health care management, Statistics, Disaster medicine, Mass casualties, Medical care, Emergency managementjam119NursingArticlesDistributed Bayesian Computation and Self-Organized Learning in Sheets of Spiking Neurons with Local Lateral Inhibition
http://academiccommons.columbia.edu/catalog/ac:192253
Bill, Johannes; Buesing, Lars; Habenschuss, Stefan; Nessler, Bernhard; Maass, Wolfgang; Legenstein, Roberthttp://dx.doi.org/10.7916/D8862G4XMon, 14 Dec 2015 00:00:00 +0000During the last decade, Bayesian probability theory has emerged as a framework in cognitive science and neuroscience for describing perception, reasoning and learning of mammals. However, our understanding of how probabilistic computations could be organized in the brain, and how the observed connectivity structure of cortical microcircuits supports these calculations, is rudimentary at best. In this study, we investigate statistical inference and self-organized learning in a spatially extended spiking network model, that accommodates both local competitive and large-scale associative aspects of neural information processing, under a unified Bayesian account. Specifically, we show how the spiking dynamics of a recurrent network with lateral excitation and local inhibition in response to distributed spiking input, can be understood as sampling from a variational posterior distribution of a well-defined implicit probabilistic model. This interpretation further permits a rigorous analytical treatment of experience-dependent plasticity on the network level. Using machine learning theory, we derive update rules for neuron and synapse parameters which equate with Hebbian synaptic and homeostatic intrinsic plasticity rules in a neural implementation. In computer simulations, we demonstrate that the interplay of these plasticity rules leads to the emergence of probabilistic local experts that form distributed assemblies of similarly tuned cells communicating through lateral excitatory connections. The resulting sparse distributed spike code of a well-adapted network carries compressed information on salient input features combined with prior experience on correlations among them. Our theory predicts that the emergence of such efficient representations benefits from network architectures in which the range of local inhibition matches the spatial extent of pyramidal cells that share common afferent input.Neurosciences, Molecular biology, Statistics, Bayesian statistical decision theory, Neurons, Neuroplasticity, InhibitionStatisticsArticlesThe WTO Dispute Settlement System: 1995-2010 Some Descriptive Statistics
http://academiccommons.columbia.edu/catalog/ac:192343
Mavroidis, Petros C.; Horn, Henrik; Johannesson, Louisehttp://dx.doi.org/10.7916/D8B27TZZFri, 11 Dec 2015 00:00:00 +0000This paper reports descriptive statistics based on the WTO Dispute Settlement Data Set (Ver. 3.0). The data set contains approximately 67 000 observations on a wide range of aspects of the Dispute Settlement (DS) system, and is exclusively based on official WTO documents. It covers all 426 WTO disputes initiated through the official filing of a Request for Consultations from January 1, 1995, until August 11, 2011, and for these disputes it includes events occurring until July 28, 2011.1 In this paper however, we will omit data pertaining to 2011 and only consider the full years 1995—2010. In order to shed some light on differences across WTO Members in participation in the DS system, we will divide Members into five groups, as specified in detail in Table 1. Broadly speaking, these groups are: G2 - The European Union (EU), and the United States (US); IND - Other industrialized countries; DEV - Developing countries other than LDC; LDC - Least developed countries; BIC - Brazil, India and China. The EU is taken to be EU-15, since the enlargements came relatively late during the period we cover. For the most part, the choice in this regard makes little difference quantitatively, since most of the 12 countries acceding to the EU in 2004 and 2007 have been relatively inactive in the WTO. The LDC group corresponds to the list of LDCs prepared by the United Nations. A more discretionary line is drawn between IND and DEV. We have classified under IND, OECD Members, the non-OECD Members among the 12 countries that most recently became members of the EU, those that are currently at an advanced stage of their accession negotiations, as well as countries that are not OECD Members but have a very high per capita income, such as Singapore. The DEV group consists of all countries which do not fit into either of the above mentioned categories, and are not BIC countries either. BIC refers to Brazil, India, and China: the sheer number of cases in which Brazil, India and China have participated, as well as their overall participation in WTO, led us to these three countries as a separate group. The paper is structured as follows: Section 2 highlights the evolution of the total use of the DS system; Section 3 discusses some aspects of participation of the groups defined above when acting as complainants or respondents; Section 4 deals with the subject-matter of disputes; Section 5 highlights a few aspects of countries’ success with regard to the legal claims they made before panels; Section 6 provides information as to the nationality and the appointment process of WTO panelists; Section 7 focuses on the duration of dispute settlement procedures at different stages of the adjudication process; Section 8 concludes.Law, International law, World Trade Organization, Dispute resolution (Law), Statistics, European Union, United States, Developed countries, Developing countries, Brazil, China, Indiapm2030LawArticlesAn Assortment of Unsupervised and Supervised Applications to Large Data
http://academiccommons.columbia.edu/catalog/ac:189937
Agne, Michael Roberthttp://dx.doi.org/10.7916/D828073NThu, 15 Oct 2015 00:00:00 +0000This dissertation presents several methods that can be applied to large datasets with an enormous number of covariates. It is divided into two parts. In the first part of the dissertation, a novel approach to pinpointing sets of related variables is introduced. In the second part, several new methods and modifications of current methods designed to improve prediction are outlined. These methods can be considered extensions of the very successful I Score suggested by Lo and Zheng in a 2002 paper and refined in many papers since. In Part I, unsupervised data (with no response) is addressed. In chapter 2, the novel unsupervised I score and its associated procedure are introduced and some of its unique theoretical properties are explored. In chapter 3, several simulations consisting of generally hard-to-wrangle scenarios demonstrate promising behavior of the approach. The method is applied to the complex field of market basket analysis, with a specific grocery data set used to show it in action in chapter 4. It is compared it to a natural competition, the A Priori algorithm. The main contribution of this part of the dissertation is the unsupervised I score, but we also suggest several ways to leverage the variable sets the I score locates in order to mine for association rules. In Part II, supervised data is confronted. Though the I Score has been used in reference to these types of data in the past, several interesting ways of leveraging it (and the modules of covariates it identifies) are investigated. Though much of this methodology adopts procedures which are individually well-established in literature, the contribution of this dissertation is organization and implementation of these methods in the context of the I Score. Several module-based regression and voting methods are introduced in chapter 7, including a new LASSO-based method for optimizing voting weights. These methods can be considered intuitive and readily applicable to a huge number of datasets of sometimes colossal size. In particular, in chapter 8, a large dataset on Hepatitis and another on Oral Cancer are analyzed. The results for some of the methods are quite promising and competitive with existing methods, especially with regard to prediction. A flexible and multifaceted procedure is suggested in order to provide a thorough arsenal when dealing with the problem of prediction in these complex data sets. Ultimately, we highlight some benefits and future directions of the method.Statistics, Biostatisticsmra2110StatisticsDissertationsDevelopment of a Parsimonious Set of City-level Environmental Performance Metrics for Jiyuan, Henan, China
http://academiccommons.columbia.edu/catalog/ac:188777
Guo, Dong; Bose, Satyajithttp://dx.doi.org/10.7916/D8VQ322WFri, 25 Sep 2015 00:00:00 +0000The potential tradeoff between the twin goals of reducing environmental impact while maintaining growth will require China’s cities to evaluate the economic impact of urban pollution at the local level. Using economic input-output analysis, city level indicators of economic activity and environmental impact and available estimates of the benchmark relationships between output and pollution by sector, we outline a method to quantify in monetary terms the marginal damages of air pollution by sector at the city level. By applying the framework of environmental accounting to the pilot case of Jiyuan, a small city in Henan province, we demonstrate a method for local public agencies to facilitate administrative tracking of monetized air pollution based on underlying economic activity, and outline a minimum set of metrics which a small city in China must track in order to estimate the monetized damage of air pollution by sector. Our methodology leverages economy-wide aggregate models (Ho and Nielsen 2007, The World Bank 2007) to significantly reduce the metrics required for a simple approximation of the relative value added per unit of emission by sector for medium-sized cities in China.Environmental economics, Public policy, Statisticsdg2350, sgb2School of Continuing Education, Earth InstituteReportsHigher-order Properties of Approximate Estimators
http://academiccommons.columbia.edu/catalog/ac:188409
Kristensen, Dennis; Salanie, Bernardhttp://dx.doi.org/10.7916/D89886BKFri, 18 Sep 2015 00:00:00 +0000Many modern estimation methods in econometrics approximate an objective function, for instance, through simulation or discretization. These approximations typically affect both bias and variance of the resulting estimator. We first provide a higher-order expansion of such "approximate" estimators that takes into account the errors due to the use of approximations. We show how a Newton-Raphson adjustment can reduce the impact of approximations. Then we use our expansions to develop inferential tools that take into account approximation errors: we propose adjustments of the approximate estimator that remove its first-order bias and adjust its standard errors. These corrections apply to a class of approximate estimators that includes all known simulation-based procedures. A Monte Carlo simulation on the mixed logit model shows that our proposed adjustments can yield spectacular improvements at a low computational cost.Statistics, Economics, Mathematics, Computer sciencebs2237EconomicsWorking papersHow New Yorkers Prefer to Take Public Transport? A Comprehensive Analysis Based on 2010-2011 Regional Household Travel Survey
http://academiccommons.columbia.edu/catalog/ac:187307
Tong, Yinanhttp://dx.doi.org/10.7916/D8ZG6RFRFri, 17 Jul 2015 12:04:56 +0000Public transport as a means of transport is an essential part of moving travelers from place to place. Considering the aggregate mode of travel, public transport is regarded as a more environmental friendly and sustainable travel mode compared to single occupancy vehicles travel. I am interested to discover the exact factors on how built environment, individual characteristics and characteristics in travel could change mode choice preference in New York Metropolitan Area. The 2010-2011 NYMTC Regional Household Travel Survey and 2010 ACS 5-year estimate data will be used to establish multinomial logit models to interpret the effects. From model results, both high population density and job density help to encourage more public transport trips. The effects of population density and job density only vary by trip purposes. Other socioeconomic and trip-based variables also play significant role on mode choice decisions.Transportation planning, Urban planning, Statisticsyt2417Urban PlanningMaster's thesesCollege students’ time use and labor market plans
http://academiccommons.columbia.edu/catalog/ac:186326
Werbin, Gregoryhttp://dx.doi.org/10.7916/D8F47N8JWed, 27 May 2015 15:52:13 +0000I examine the patterns of association between college students’ time use and their senior-year labor market expectations. Using data from the National Longitudinal Survey of Freshmen, I investigate the relationship between reported time use and students’ plans after graduation. Specifically, I consider three labor market outcomes: whether students intend to work full-time work after graduating (regardless of field), whether they intend to start working in a job (full- or part-time) that is a step in a desired career, and whether they apply to at least one graduate school. The problem reduces to determining which time use components are associated with each outcome, and then quantifying the relative strengths of those associations. Using elastic-net penalized regression for variable selection, I find that the activity most negatively associated with full-time job plans is time spent in class, while socially-oriented activities are the strongest positive predictors. This result can be explained by the inverse relationship between full-time job plans and applying to graduate school.Statistics, Social research, Higher educationgw2286Quantitative Methods in the Social Sciences, Economics (Barnard College)Master's thesesEfficiency in Lung Transplant Allocation Strategies
http://academiccommons.columbia.edu/catalog/ac:187899
Zou, Jingjinghttp://dx.doi.org/10.7916/D8QV3KKZTue, 12 May 2015 18:28:18 +0000Currently in the United States, lungs are allocated to transplant candidates based on the Lung Allocation Score (LAS). The LAS is an empirically derived score aimed at increasing total life span pre- and post-transplantation, for patients on lung transplant waiting lists. The goal here is to develop efficient allocation strategies in the context of lung transplantation.
In this study, patient and organ arrivals to the waiting list are modeled as independent homogeneous Poisson processes. Patients' health status prior to allocations are modeled as evolving according to independent and identically distributed finite-state inhomogeneous Markov processes, in which death is treated as an absorbing state. The expected post-transplantation residual life is modeled as depending on time on the waiting list and on current health status. For allocation strategies satisfying certain minimal fairness requirements, the long-term limit of expected average total life exists, and is used as the standard for comparing allocation strategies.
Via the Hamilton-Jacobi-Bellman equations, upper bounds as a function of the ratio of organ arrival rate to the patient arrival rate for the long-term expected average total life are derived, and corresponding to each upper bound is an allocable set of (state, time) pairs at which patients would be optimally transplanted. As availability of organs increases, the allocable set expands monotonically, and ranking members of the waiting list according to the availability at which they enter the allocable set provides an allocation strategy that leads to long-term expected average total life close to the upper bound.
Simulation studies are conducted with model parameters estimated from national lung transplantation data from United Network for Organ Sharing (UNOS). Results suggest that compared to the LAS, the proposed allocation strategy could provide a 7% increase in average total life.Statisticsjz2335StatisticsDissertationsExtreme Storm Surge Hazard Estimation and Windstorm Vulnerability Assessment for Quantitative Risk Analysis
http://academiccommons.columbia.edu/catalog/ac:186995
Lopeman, Madeleine Elisehttp://dx.doi.org/10.7916/D8BC3XNRThu, 07 May 2015 00:24:18 +0000Quantification of risk to natural disasters is a valuable endeavor from engineering, policy and (re)insurance perspectives. This work presents two research efforts relating to meteorological risk, specifically with regard to storm surge hazard estimation and wind vulnerability assessment.
While many high water level hazard estimation methods have been presented in the literature and used in industry applications, none bases its results on disaggregated tidal gauge data while also capturing the effects of the evolution of storm surge over the duration of a storm. Additionally, the coastal destruction wreaked by Hurricane Sandy in 2012 prompted motivation to estimate the event’s return period. To that end, this dissertation first presents the motivation for and development of the clustered separated peaks-over-threshold simulation (CSPS) method, a novel approach to the estimation of high water level return periods at coastal locations. The CSPS uses a Monte Carlo simulation of storm surge activity based on statistics derived from tidal gauge data. The data are separated into three independent components (storm surge, tidal cycle and sea level rise) because different physical processes govern different components of water level. Peak storm surge heights are fit to the generalized Pareto distribution, chosen for its ability to fit a wide tail to limited data, and a clustering algorithm incorporates the evolution of storm surge over surge duration. Confidence intervals on the return period estimates are computed by applying the bootstrapping method to the storm surge data.
Two case studies demonstrate the application of the CSPS to coastal tidal gauge data. First, the CSPS is applied to tidal gauge data from lower Manhattan. The results suggest that the return period of Hurricane Sandy’s peak water level is 103 years (95% confidence interval 38–452 years). That the CSPS estimate is significantly lower than previously published return periods indicates that storm surge hazard in the New York Harbor has, until now, been underestimated. The CSPS is also applied to all tidal gauge stations managed by the National Oceanographic and Atmospheric Administration (NOAA) for which the hourly water level time histories are at least 30 years long. Comparison to NOAA’s exceedance probability levels for these stations suggests that the CSPS estimates higher return levels than NOAA, but also that the NOAA values fall within the 95% CI from the CSPS for more than half of the stations tested.
This dissertation continues with a critical comparison of windstorm vulnerability models. The intent of this research is to provide a compendium of reference curves against which to compare damage curves used in the reinsurance industry. The models tend to represent specific types of construction and use varying characteristic wind speed measurements to represent storm intensity. Wind speed conversion methods are used to harmonize wind speed scales. The different vulnerability models analyzed stem from different datasets and hypotheses, thus rendering them relevant to certain geographies or structural typologies. The resulting collection of comparable windstorm vulnerability models can serve as a reference framework against which damage curves from catastrophe risk models can be evaluated.Civil engineering, Hydrologic sciences, StatisticsCivil Engineering and Engineering MechanicsDissertationsStatistical Searches for Microlensing Events in Large, Non-Uniformly Sampled Time-Domain Surveys: A Test Using Palomar Transient Factory Data
http://academiccommons.columbia.edu/catalog/ac:185413
Price-Whelan, Adrian Michael; Agüeros, Marcel Andre; Fournier, Amanda P.; Street, Rachel; Ofek, Eran O.; Covey, Kevin R.; Levitan, David; Laher, Russ R.; Sesar, Branimir; Surace, Jasonhttp://dx.doi.org/10.7916/D8HM57CMFri, 03 Apr 2015 00:00:00 +0000Many photometric time-domain surveys are driven by specific goals, such as searches for supernovae or transiting exoplanets, which set the cadence with which fields are re-imaged. In the case of the Palomar Transient Factory (PTF), several sub-surveys are conducted in parallel, leading to non-uniform sampling over its ~20,000 deg2 footprint. While the median 7.26 deg2 PTF field has been imaged ~40 times in the R band, ~2300 deg2 have been observed >100 times. We use PTF data to study the trade off between searching for microlensing events in a survey whose footprint is much larger than that of typical microlensing searches, but with far-from-optimal time sampling. To examine the probability that microlensing events can be recovered in these data, we test statistics used on uniformly sampled data to identify variables and transients. We find that the von Neumann ratio performs best for identifying simulated microlensing events in our data. We develop a selection method using this statistic and apply it to data from fields with >10 R-band observations, 1.1 × 109 light curves, uncovering three candidate microlensing events. We lack simultaneous, multi-color photometry to confirm these as microlensing events. However, their number is consistent with predictions for the event rate in the PTF footprint over the survey's three years of operations, as estimated from near-field microlensing models. This work can help constrain all-sky event rate predictions and tests microlensing signal recovery in large data sets, which will be useful to future time-domain surveys, such as that planned with the Large Synoptic Survey Telescope.Astronomy, Statisticsamp2217, maa17AstronomyArticlesGLMLE: graph-limit enabled fast computation for fitting exponential random graph models to large social networks
http://academiccommons.columbia.edu/catalog/ac:185410
He, Ran; Zheng, Tianhttp://dx.doi.org/10.7916/D8S46QVQThu, 02 Apr 2015 00:00:00 +0000Large network, as a form of big data, has received increasing amount of attention in data science, especially for large social network, which is reaching the size of hundreds of millions, with daily interactions on the scale of billions. Thus analyzing and modeling these data to understand the connectivities and dynamics of large networks is important in a wide range of scientific fields. Among popular models, exponential random graph models (ERGMs) have been developed to study these complex networks by directly modeling network structures and features. ERGMs, however, are hard to scale to large networks because maximum likelihood estimation of parameters in these models can be very difficult, due to the unknown normalizing constant. Alternative strategies based on Markov chain Monte Carlo (MCMC) draw samples to approximate the likelihood, which is then maximized to obtain the maximum likelihood estimators (MLE). These strategies have poor convergence due to model degeneracy issues and cannot be used on large networks. Chatterjee et al. (Ann Stat 41:2428–2461, 2013) propose a new theoretical framework for estimating the parameters of ERGMs by approximating the normalizing constant using the emerging tools in graph theory—graph limits. In this paper, we construct a complete computational procedure built upon their results with practical innovations which is fast and is able to scale to large networks. More specifically, we evaluate the likelihood via simple function approximation of the corresponding ERGM’s graph limit and iteratively maximize the likelihood to obtain the MLE. We also discuss the methods of conducting likelihood ratio test for ERGMs as well as related issues. Through simulation studies and real data analysis of two large social networks, we show that our new method outperforms the MCMC-based method, especially when the network size is large (more than 100 nodes). One limitation of our approach, inherited from the limitation of the result of Chatterjee et al. (Ann Stat 41:2428–2461, 2013), is that it works only for sequences of graphs with a positive limiting density, i.e., dense graphs.Statisticsrh2528, tz33StatisticsArticlesA Practical Guide to Measuring Social Structure Using Indirectly Observed Network Data
http://academiccommons.columbia.edu/catalog/ac:185370
McCormick, Tyler H.; Moussa, Amal; DiPrete, Thomas A.; Ruf, Johannes; Gelman, Andrew E.; Teitler, Julien O.; Zheng, Tianhttp://dx.doi.org/10.7916/D86H4G9DTue, 31 Mar 2015 00:00:00 +0000Aggregated relational data (ARD) are an increasingly common tool for learning about social networks through standard surveys. Recent statistical advances present social scientists with new options for analyzing such data. In this article, we propose guidelines for learning about various network processes using ARD and a template to aid practitioners. We first propose that ARD can be used to measure “social distance” between a respondent and a subpopulation (individuals named Kevin, those in prison, or those serving in the military). We then present common methods for analyzing these data and associate each of these methods with a specific way of measuring social distance, thus associating statistical tools with their underlying social science phenomena. We examine the implications of using each of these social distance measures using an Internet survey about contemporary political issues.Statistics, Social researchtad61, ag389, jot8, tz33Sociology, Statistics, Social WorkArticlesHow many people do you know?: Efficiently estimating personal network size
http://academiccommons.columbia.edu/catalog/ac:185367
Zheng, Tian; Salganik, Matthew J.; McCormick, Tyler H.http://dx.doi.org/10.7916/D8FX78BTTue, 31 Mar 2015 00:00:00 +0000In this paper we develop a method to estimate both individual social network size (i.e., degree) and the distribution of network sizes in a population by asking respondents how many people they know in specific subpopulations (e.g., people named Michael). Building on the scale-up method of Killworth et al. and other previous attempts to estimate individual network size, we propose a latent non-random mixing model which resolves three known problems with previous approaches. As a byproduct, our method also provides estimates of the rate of social mixing between population groups. We demonstrate the model using a sample of 1,370 adults originally collected by McCarty et al. (2001). Based on insights developed during the statistical modeling, we conclude by offering practical guidelines for the design of future surveys to estimate social network size. Most importantly, we show that if the first names to be asked about are chosen properly, the simple scale-up degree estimates can enjoy the same bias-reduction as that from the our more complex latent non-random mixing model.Statistics, Social researchtz33StatisticsArticlesSurveying Hard-to-Reach Groups Through Sampled Respondents in a Social Network
http://academiccommons.columbia.edu/catalog/ac:185373
McCormick, Tyler H.; Zheng, Tian; He, Ran; Kolaczyk, Erichttp://dx.doi.org/10.7916/D8Z0372NTue, 31 Mar 2015 00:00:00 +0000The sampling frame in most social science surveys misses members of certain groups, such as the homeless or individuals living with HIV. These groups are known as hard-to-reach groups. One strategy for learning about these groups, or subpopulations, involves reaching hard-to-reach group members through their social network. In this paper we compare the efficiency of two common methods for subpopulation size estimation using data from standard surveys. These designs are examples of mental link tracing designs. These designs begin with a randomly sampled set of network members (nodes) and then reach other nodes indirectly through questions asked to the sampled nodes. Mental link tracing designs cost significantly less than traditional link tracing designs, yet introduce additional sources of potential bias. We examine the influence of one such source of bias using simulation studies. We then demonstrate our findings using data from the General Social Survey collected in 2004 and 2006. Additionally, we provide survey design suggestions for future surveys incorporating such designs.Statistics, Social researchtz33, rh2528StatisticsArticlesBackward Genotype-Trait Association (BGTA)-Based Dissection of Complex Traits in Case-Control Designs
http://academiccommons.columbia.edu/catalog/ac:185325
Zheng, Tian; Wang, Hui; Lo, Shaw-Hwahttp://dx.doi.org/10.7916/D8SF2V33Mon, 30 Mar 2015 00:00:00 +0000Background: The studies of complex traits project new challenges to current methods that evaluate association between genotypes and a specific trait. Consideration of possible interactions among loci leads to overwhelming dimensions that cannot be handled using current statistical methods. Methods: In this article, we evaluate a multi-marker screening algorithm--the backward genotype-trait association (BGTA) algorithm for case-control designs, which uses unphased multi-locus genotypes. BGTA carries out a global investigation on a candidate marker set and automatically screens out markers carrying diminutive amounts of information regarding the trait in question. To address the "too many possible genotypes, too few informative chromosomes" dilemma of a genomic-scale study that consists of hundreds to thousands of markers, we further investigate a BGTA-based marker selection procedure, in which the screening algorithm is repeated on a large number of random marker subsets. Results of these screenings are then aggregated into counts that the markers are retained by the BGTA algorithm. Markers with exceptional high counts of returns are selected for further analysis. Results and Conclusion: Evaluated using simulations under several disease models, the proposed methods prove to be more powerful in dealing with epistatic traits.We also demonstrate the proposed methods through an application to a study on the inflammatory bowel disease.Statistics, Genetics, Biostatisticstz33, hw2334, shl5Microbiology and Immunology, Statistics, BiostatisticsArticlesHow Many People Do You Know in Prison? Using Overdispersion in Count Data to Estimate Social Structure in Networks
http://academiccommons.columbia.edu/catalog/ac:185364
Zheng, Tian; Salganik, Matthew J.; Gelman, Andrew E.http://dx.doi.org/10.7916/D800011WMon, 30 Mar 2015 00:00:00 +0000Networks—sets of objects connected by relationships—are important in a number of fields. The study of networks has long been central to sociology, where researchers have attempted to understand the causes and consequences of the structure of relationships in large groups of people. Using insight from previous network research, Killworth et al. and McCarty et al. have developed and evaluated a method for estimating the sizes of hard-to-count populations using network data collected from a simple random sample of Americans. In this article we show how, using a multilevel overdispersed Poisson regression model, these data also can be used to estimate aspects of social structure in the population. Our work goes beyond most previous research on networks by using variation, as well as average responses, as a source of information. We apply our method to the data of McCarty et al. and find that Americans vary greatly in their number of acquaintances. Further, Americans show great variation in propensity to form ties to people in some groups (e.g., males in prison, the homeless, and American Indians), but little variation for other groups (e.g., twins, people named Michael or Nicole). We also explore other features of these data and consider ways in which survey data can be used to estimate network structure.Statistics, Social researchtz33, ag389Political Science, StatisticsArticlesComment: Quantifying the Fraction of Missing Information for Hypothesis Testing in Statistical and Genetic Studies
http://academiccommons.columbia.edu/catalog/ac:184983
Zheng, Tian; Lo, Shaw-Hwahttp://dx.doi.org/10.7916/D84T6H8MSat, 28 Mar 2015 00:00:00 +0000The authors suggest an interesting way to measure the fraction of missing information in the context of hypothesis testing. The measure seeks to quantify the impact of missing observations on the test between two hypotheses. The amount of impact can be useful information for applied research. An example is, in genetics, where multiple tests of the same sort are performed on different variables with different missing rates, and follow-up studies may be designed to resolve missing values in selected variables. In this discussion, we offer our prospective views on the use of relative information in a follow-up study. For studies where the impact of missing observations varies greatly across different variables and where the investigators have the flexibility of designing studies that can have different efforts on variables, an optimal design may be derived using relative information measures to improve the cost-effectiveness of the followup.Statisticstz33, shl5StatisticsArticlesDiscovering influential variables: A method of partitions
http://academiccommons.columbia.edu/catalog/ac:184953
Chernoff, Herman; Lo, Shaw-Hwa; Zheng, Tianhttp://dx.doi.org/10.7916/D8PR7TVMFri, 27 Mar 2015 00:00:00 +0000A trend in all scientific disciplines, based on advances in technology, is the increasing availability of high dimensional data in which are buried important information. A current urgent challenge to statisticians is to develop effective methods of finding the useful information from the vast amounts of messy and noisy data available, most of which are noninformative. This paper presents a general computer intensive approach, based on a method pioneered by Lo and Zheng for detecting which, of many potential explanatory variables, have an influence on a dependent variable Y. This approach is suited to detect influential variables, where causal effects depend on the confluence of values of several variables. It has the advantage of avoiding a difficult direct analysis, involving possibly thousands of variables, by dealing with many randomly selected small subsets from which smaller subsets are selected, guided by a measure of influence I. The main objective is to discover the influential variables, rather than to measure their effects. Once they are detected, the problem of dealing with a much smaller group of influential variables should be vulnerable to appropriate analysis. In a sense, we are confining our attention to locating a few needles in a haystack.Statistics, Computer scienceshl5, tz33StatisticsArticlesLatent demographic profile estimation in hard-to-reach groups
http://academiccommons.columbia.edu/catalog/ac:184956
McCormick, Tyler H.; Zheng, Tianhttp://dx.doi.org/10.7916/D8F76BFQFri, 27 Mar 2015 00:00:00 +0000The sampling frame in most social science surveys excludes members of certain groups, known as hard-to-reach groups. These groups, or subpopulations, may be difficult to access (the homeless, e.g.), camouflaged by stigma (individuals with HIV/AIDS), or both (commercial sex workers). Even basic demographic information about these groups is typically unknown, especially in many developing nations. We present statistical models which leverage social network structure to estimate demographic characteristics of these subpopulations using Aggregated relational data (ARD), or questions of the form “How many X’s do you know?” Unlike other network-based techniques for reaching these groups, ARD require no special sampling strategy and are easily incorporated into standard surveys. ARD also do not require respondents to reveal their own group membership. We propose a Bayesian hierarchical model for estimating the demographic characteristics of hard-to-reach groups, or latent demographic profiles, using ARD. We propose two estimation techniques. First, we propose a Markov-chain Monte Carlo algorithm for existing data or cases where the full posterior distribution is of interest. For cases when new data can be collected, we propose guidelines and, based on these guidelines, propose a simple estimate motivated by a missing data approach. Using data from McCarty et al. [Human Organization 60 (2001) 28–39], we estimate the age and gender profiles of six hard-to-reach groups, such as individuals who have HIV, women who were raped, and homeless persons. We also evaluate our simple estimates using simulation studies.Statisticstz33StatisticsArticlesOn Bootstrap Tests of Symmetry About an Unknown Median
http://academiccommons.columbia.edu/catalog/ac:184965
Zheng, Tian; Gastwirth, Joseph L.http://dx.doi.org/10.7916/D8X9296PFri, 27 Mar 2015 00:00:00 +0000It is important to examine the symmetry of an underlying distribution before applying some statistical procedures to a data set. For example, in the Zuni School District case, a formula originally developed by the Department of Education trimmed 5% of the data symmetrically from each end. The validity of this procedure was questioned at the hearing by Chief Justice Roberts. Most tests of symmetry (even nonparametric ones) are not distribution free in finite sample sizes. Hence, using asymptotic distribution may not yield an accurate type I error rate or/and loss of power in small samples. Bootstrap resampling from a symmetric empirical distribution function fitted to the data is proposed to improve the accuracy of the calculated p-value of several tests of symmetry. The results show that the bootstrap method is superior to previously used approaches relying on the asymptotic distribution of the tests that assumed the data come from a normal distribution. Incorporating the bootstrap estimate in a recently proposed test due to Miao, Gel and Gastwirth (2006) preserved its level and shows it has reasonable power properties on the family of distribution evaluated.Statisticstz33StatisticsArticlesPolymorphisms in the Mitochondrial DNA Control Region and Frailty in Older Adults
http://academiccommons.columbia.edu/catalog/ac:184807
Moore, Anne Z.; Biggs, Mary L.; O'Connor, Ashley; Matteini, Amy; McGuire, Sarah; Beamer, Brock A.; Fallin, M. Danielle; Waltson, Jeremy; Fried, Linda P. ; Chakravarti, Aravinda; Arking, Dan E.http://dx.doi.org/10.7916/D83R0RRHTue, 24 Mar 2015 00:00:00 +0000Background: Mitochondria contribute to the dynamics of cellular metabolism, the production of reactive oxygen species, and apoptotic pathways. Consequently, mitochondrial function has been hypothesized to influence functional decline and vulnerability to disease in later life. Mitochondrial genetic variation may contribute to altered susceptibility to the frailty syndrome in older adults. Methodology/Principal Findings: To assess potential mitochondrial genetic contributions to the likelihood of frailty, mitochondrial DNA (mtDNA) variation was compared in frail and non-frail older adults. Associations of selected SNPs with a muscle strength phenotype were also explored. Participants were selected from the Cardiovascular Health Study (CHS), a population-based observational study (1989–1990, 1992–1993). At baseline, frailty was identified as the presence of three or more of five indicators (weakness, slowness, shrinking, low physical activity, and exhaustion). mtDNA variation was assessed in a pilot study, including 315 individuals selected as extremes of the frailty phenotype, using an oligonucleotide sequencing microarray based on the Revised Cambridge Reference Sequence. Three mtDNA SNPs were statistically significantly associated with frailty across all pilot participants or in sex-stratified comparisons: mt146, mt204, and mt228. In addition to pilot participants, 4,459 additional men and women with frailty classifications, and an overlapping subset of 4,453 individuals with grip strength measurements, were included in the study population genotyped at mt204 and mt228. In the study population, the mt204 C allele was associated with greater likelihood of frailty (adjusted odds ratio = 2.04, 95% CI = 1.07–3.60, p = 0.020) and lower grip strength (adjusted coefficient = −2.04, 95% CI = −3.33– −0.74, p = 0.002). Conclusions: This study supports a role for mitochondrial genetic variation in the frailty syndrome and later life muscle strength, demonstrating the importance of the mitochondrial genome in complex geriatric phenotypes.Genetics, Medicine, Statisticslf2296Mailman School of Public HealthArticlesBiodiversity and Ecosystem Multi-Functionality: Observed Relationships in Smallholder Fallows in Western Kenya
http://academiccommons.columbia.edu/catalog/ac:184813
Sircely, Jason; Naeem, Shahidhttp://dx.doi.org/10.7916/D8V986XHTue, 24 Mar 2015 00:00:00 +0000Recent studies indicate that species richness can enhance the ability of plant assemblages to support multiple ecosystem functions. To understand how and when ecosystem services depend on biodiversity, it is valuable to expand beyond experimental grasslands. We examined whether plant diversity improves the capacity of agroecosystems to sustain multiple ecosystem services—production of wood and forage, and two elements of soil formation—in two types of smallholder fallows in western Kenya. In 18 grazed and 21 improved fallows, we estimated biomass and quantified soil organic carbon, soil base cations, sand content, and soil infiltration capacity. For four ecosystem functions (wood biomass, forage biomass, soil base cations, steady infiltration rates) linked to the focal ecosystem services, we quantified ecosystem service multi-functionality as (1) the proportion of functions above half-maximum, and (2) mean percentage excess above mean function values, and assessed whether plant diversity or environmental favorability better predicted multi-functionality. In grazed fallows, positive effects of plant diversity best explained the proportion above half-maximum and mean percentage excess, the former also declining with grazing intensity. In improved fallows, the proportion above half-maximum was not associated with soil carbon or plant diversity, while soil carbon predicted mean percentage excess better than diversity. Grazed fallows yielded stronger evidence for diversity effects on multi-functionality, while environmental conditions appeared more influential in improved fallows. The contrast in diversity-multi-functionality relationships among fallow types appears related to differences in management and associated factors including disturbance and species composition. Complementary effects of species with contrasting functional traits on different functions and multi-functional species may have contributed to diversity effects in grazed fallows. Biodiversity and environmental favorability may enhance the capacity of smallholder fallows to simultaneously provide multiple ecosystem services, yet their effects are likely to vary with fallow management.Ecology, Environmental studies, Statisticssn2121Ecology, Evolution, and Environmental BiologyArticlesOn Identifying Rare Variants for Complex Human Traits
http://academiccommons.columbia.edu/catalog/ac:197118
Fan, Ruixuehttp://dx.doi.org/10.7916/D8N29VT4Mon, 16 Mar 2015 00:00:00 +0000This thesis focuses on developing novel statistical tests for rare variants association analysis incorporating both marginal effects and interaction effects among rare variants. Compared with common variants, rare variants have lower minor allele frequencies (typically less than 5%), and hence traditional association tests for common variants will lose power for rare variants. Therefore, there is a pressing need of new analytical tools to tackle the problem of rare variants association with complex human traits. Several collapsing methods have been proposed that aggregate information of rare variants in a region and test them together. They can be divided into burden tests and non-burden tests based on their aggregation strategies. They are all variations of regression-based methods with the assumption that the phenotype is associated with the genotype via a (linear) regression model. Most of these methods consider only marginal effects of rare variants and fail to take into account gene-gene and gene-environmental interactive effects, which are ubiquitous and are of utmost importance in biological systems. In this thesis, we propose a summation of partition approach (SPA) -- a nonparametric strategy for rare variants association analysis. Extensive simulation studies show that SPA is powerful in detecting not only marginal effects but also gene-gene interaction effects of rare variants. Moreover, extensions of SPA are able to detect gene-environment interactions and other interactions existing in complicated biological system as well. We are also able to obtain the asymptotic behavior of the marginal SPA score, which guarantees the power of the proposed method. Inspired by the idea of stepwise variable selection, a significance-based backward dropping algorithm(SDA) is proposed to locate truly influential rare variants in a genetic region that has been identified significant. Unlike traditional backward dropping approaches which remove the least significant variables first, SDA introduces the idea of eliminating the most significant variable at each round. The removed variables are collected and their effects are evaluated by an influence ratio score -- the relative p-value change. Our simulation studies show that SDA is powerful to detect causal variables and SDA has lower false discovery rate than LASSO. We also demonstrate our method using the dataset provided by Genetic Analysis Workshop (GAW) 17 and the results support the superiority of SDA over LASSO. The general partition-retention framework can also be applied to detect gene-environmental interaction effects for common variants. We demonstrate this method using the dataset from Genetic Analysis Workshop (GAW) 18. Our nonparametric approach is able to identify a lot more possible influential gene-environmental pairs than traditional linear regression models. We propose in this thesis a "SPA-SDA" two step approach for rare variants association analysis at genomic scale: first identify significant regions of moderate sizes using SPA, and then apply SDA to the identified regions to pinpoint truly influential variables. This approach is computationally efficient for genomic data and it has the capacity to detect gene-gene and gene-environmental interactions.Statistics, Bioinformatics, Human genetics--Variation, Regression analysis, Genetics--Statistical methods, Genomics--Data Processingrf2283StatisticsDissertationsEstimating Preferences under Risk: The Case of Racetrack Bettors
http://academiccommons.columbia.edu/catalog/ac:184178
Jullien, Bruno; Salanie, Bernardhttp://dx.doi.org/10.7916/D8S75F6JTue, 10 Mar 2015 00:00:00 +0000In this paper we investigate the attitudes toward risk of bettors in British horse races. The model we use allows us to go beyond the expected utility framework and to explore various alternative proposals by estimating a multinomial model on a 34,443‐race data set. We find that rank‐dependent utility models do not fit the data noticeably better than expected utility models. On the other hand, cumulative prospect theory has higher explanatory power. Our preferred estimates suggest a pattern of local risk aversion similar to that proposed by Friedman and Savage.Economics, Economic theory, Statisticsbs2237EconomicsArticlesEffect of Childhood Victimization on Occupational Prestige and Income Trajectories
http://academiccommons.columbia.edu/catalog/ac:184166
Christ, Sharon L.; Fernandez, Cristina A. ; LeBlanc, William G.; McCollister, Kathyrn E.; Arheart, Kristopher L.; Dietz, Noella A.; Fleming, Lora E.; Muntaner, Carles ; Muennig, Peter A.; Lee, David J.http://dx.doi.org/10.7916/D88C9V3DFri, 06 Mar 2015 00:00:00 +0000Background Violence toward children (childhood victimization) is a major public health problem, with long-term consequences on economic well-being. The purpose of this study was to determine whether childhood victimization affects occupational prestige and income in young adulthood. We hypothesized that young adults who experienced more childhood victimizations would have less prestigious jobs and lower incomes relative to those with no victimization history. We also explored the pathways in which childhood victimization mediates the relationships between background variables, such as parent’s educational impact on the socioeconomic transition into adulthood. Methods A nationally representative sample of 8,901 young adults aged 18–28 surveyed between 1999–2009 from the National Longitudinal Survey of Youth 1997 (NLSY) were analyzed. Covariate-adjusted multivariate linear regression and path models were used to estimate the effects of victimization and covariates on income and prestige levels and on income and prestige trajectories. After each participant turned 18, their annual 2002 Census job code was assigned a yearly prestige score based on the 1989 General Social Survey, and their annual income was calculated via self-reports. Occupational prestige and annual income are time-varying variables measured from 1999–2009. Victimization effects were tested for moderation by sex, race, and ethnicity in the multivariate models. Results Approximately half of our sample reported at least one instance of childhood victimization before the age of 18. Major findings include 1) childhood victimization resulted in slower income and prestige growth over time, and 2) mediation analyses suggested that this slower prestige and earnings arose because victims did not get the same amount of education as non-victims. Conclusions Results indicated that the consequences of victimization negatively affected economic success throughout young adulthood, primarily by slowing the growth in prosperity due to lower education levels.Public health, Sociology, Statisticspm124Health Policy and ManagementArticlesIncrease in Diarrheal Disease Associated with Arsenic Mitigation in Bangladesh
http://academiccommons.columbia.edu/catalog/ac:184172
Wu, Jianyong; Jahangir Alam, Yasuyuki Akita; van Geen, Alexander; Ahmed, Kazi Matin; Culligan, Patricia J.; Escamilla, Veronica; Feighery, John; Ferguson, Andrew S.; Knappett, Peter; Mailloux, Brian Justin; McKay, Larry D.; Serre, Marc L. ; Streatfield, P. Kim; Yunus, Mohammad; Emch, Michael http://dx.doi.org/10.7916/D87D2T01Fri, 06 Mar 2015 00:00:00 +0000Background Millions of households throughout Bangladesh have been exposed to high levels of arsenic (As) causing various deadly diseases by drinking groundwater from shallow tubewells for the past 30 years. Well testing has been the most effective form of mitigation because it has induced massive switching from tubewells that are high (>50 µg/L) in As to neighboring wells that are low in As. A recent study has shown, however, that shallow low-As wells are more likely to be contaminated with the fecal indicator E. coli than shallow high-As wells, suggesting that well switching might lead to an increase in diarrheal disease. Methods Approximately 60,000 episodes of childhood diarrhea were collected monthly by community health workers between 2000 and 2006 in 142 villages of Matlab, Bangladesh. In this cross-sectional study, associations between childhood diarrhea and As levels in tubewell water were evaluated using logistic regression models. Results Adjusting for wealth, population density, and flood control by multivariate logistic regression, the model indicates an 11% (95% confidence intervals (CIs) of 4–19%) increase in the likelihood of diarrhea in children drinking from shallow wells with 10–50 µg/L As compared to shallow wells with >50 µg/L As. The same model indicates a 26% (95%CI: 9–42%) increase in diarrhea for children drinking from shallow wells with ≤10 µg/L As compared to shallow wells with >50 µg/L As. Conclusion Children drinking water from shallow low As wells had a higher prevalence of diarrhea than children drinking water from high As wells. This suggests that the health benefits of reducing As exposure may to some extent be countered by an increase in childhood diarrhea.Public health, Statisticspjc2104, bjm2103Civil Engineering and Engineering Mechanics, Environmental Science (Barnard College)ArticlesDynamical Phenotyping: Using Temporal Analysis of Clinically Collected Physiologic Data to Stratify Populations
http://academiccommons.columbia.edu/catalog/ac:184147
Albers, David J.; Elhadad, Noemie; Tabak, E.; Perotte, Adler; Hripcsak, George M.http://dx.doi.org/10.7916/D8W9581VFri, 06 Mar 2015 00:00:00 +0000Using glucose time series data from a well measured population drawn from an electronic health record (EHR) repository, the variation in predictability of glucose values quantified by the time-delayed mutual information (TDMI) was explained using a mechanistic endocrine model and manual and automated review of written patient records. The results suggest that predictability of glucose varies with health state where the relationship (e.g., linear or inverse) depends on the source of the acuity. It was found that on a fine scale in parameter variation, the less insulin required to process glucose, a condition that correlates with good health, the more predictable glucose values were. Nevertheless, the most powerful effect on predictability in the EHR subpopulation was the presence or absence of variation in health state, specifically, in- and out-of-control glucose versus in-control glucose. Both of these results are clinically and scientifically relevant because the magnitude of glucose is the most commonly used indicator of health as opposed to glucose dynamics, thus providing for a connection between a mechanistic endocrine model and direct insight to human health via clinically collected data.Medicine, Endocrinology, Statisticsdja2119, ne60, ajp2120, gh13Biomedical InformaticsArticlesPopulation Physiology: Leveraging Electronic Health Record Data to Understand Human Endocrine Dynamics
http://academiccommons.columbia.edu/catalog/ac:184150
Albers, David J. ; Hripcsak, George M.; Schmidt, J. Michaelhttp://dx.doi.org/10.7916/D8KW5DWSFri, 06 Mar 2015 00:00:00 +0000Studying physiology and pathophysiology over a broad population for long periods of time is difficult primarily because collecting human physiologic data can be intrusive, dangerous, and expensive. One solution is to use data that have been collected for a different purpose. Electronic health record (EHR) data promise to support the development and testing of mechanistic physiologic models on diverse populations and allow correlation with clinical outcomes, but limitations in the data have thus far thwarted such use. For example, using uncontrolled population-scale EHR data to verify the outcome of time dependent behavior of mechanistic, constructive models can be difficult because: (i) aggregation of the population can obscure or generate a signal, (ii) there is often no control population with a well understood health state, and (iii) diversity in how the population is measured can make the data difficult to fit into conventional analysis techniques. This paper shows that it is possible to use EHR data to test a physiological model for a population and over long time scales. Specifically, a methodology is developed and demonstrated for testing a mechanistic, time-dependent, physiological model of serum glucose dynamics with uncontrolled, population-scale, physiological patient data extracted from an EHR repository. It is shown that there is no observable daily variation the normalized mean glucose for any EHR subpopulations. In contrast, a derived value, daily variation in nonlinear correlation quantified by the time-delayed mutual information (TDMI), did reveal the intuitively expected diurnal variation in glucose levels amongst a random population of humans. Moreover, in a population of continuously (tube) fed patients, there was no observable TDMI-based diurnal signal. These TDMI-based signals, via a glucose insulin model, were then connected with human feeding patterns. In particular, a constructive physiological model was shown to correctly predict the difference between the general uncontrolled population and a subpopulation whose feeding was controlled.Statistics, Medicinedja2119, gh13, mjs2134Biomedical Informatics, NeurologyArticlesSigns of the 2009 Influenza Pandemic in the New York-Presbyterian Hospital Electronic Health Records
http://academiccommons.columbia.edu/catalog/ac:184153
Khiabanian, Hossein; Holmes, Antony B.; Kelly, Brendan J.; Gururaj, Mrinalini; Hripcsak, George M.; Rabadan, Raulhttp://dx.doi.org/10.7916/D82V2F0DFri, 06 Mar 2015 00:00:00 +0000Background In June of 2009, the World Health Organization declared the first influenza pandemic of the 21st century, and by July, New York City's New York-Presbyterian Hospital (NYPH) experienced a heavy burden of cases, attributable to a novel strain of the virus (H1N1pdm). Methods and Results We present the signs in the NYPH electronic health records (EHR) that distinguished the 2009 pandemic from previous seasonal influenza outbreaks via various statistical analyses. These signs include (1) an increase in the number of patients diagnosed with influenza, (2) a preponderance of influenza diagnoses outside of the normal flu season, and (3) marked vaccine failure. The NYPH EHR also reveals distinct age distributions of patients affected by seasonal influenza and the pandemic strain, and via available longitudinal data, suggests that the two may be associated with distinct sets of comorbid conditions as well. In particular, we find significantly more pandemic flu patients with diagnoses associated with asthma and underlying lung disease. We further observe that the NYPH EHR is capable of tracking diseases at a resolution as high as particular zip codes in New York City. Conclusion The NYPH EHR permits early detection of pandemic influenza and hypothesis generation via identification of those significantly associated illnesses. As data standards develop and databases expand, EHRs will contribute more and more to disease detection and the discovery of novel disease associations.Medicine, Statistics, Public healthhk2524, abh2138, gh13, rr2579Biomedical InformaticsArticlesDrinking Patterns and Alcohol Use Disorders in São Paulo, Brazil: The Role of Neighborhood Social Deprivation and Socioeconomic Status
http://academiccommons.columbia.edu/catalog/ac:184775
Silveira, Camila Magalhaes; Siu, Erica Rosanna; Anthony, James C.; Saito, Luis Paulo; Guerra de Andrade, Arthur; Kutschenko, Andressa; Viana, Maria Carmen; Wang, Yuan-Pang; Martins, Silvia S.; Andrade, Laura Helenahttp://dx.doi.org/10.7916/D89C6W9PFri, 06 Mar 2015 00:00:00 +0000Background Research conducted in high-income countries has investigated influences of socioeconomic inequalities on drinking outcomes such as alcohol use disorders (AUD), however, associations between area-level neighborhood social deprivation (NSD) and individual socioeconomic status with these outcomes have not been explored in Brazil. Thus, we investigated the role of these factors on drink-related outcomes in a Brazilian population, attending to male-female variations. Methods A multi-stage area probability sample of adult household residents in the São Paulo Metropolitan Area was assessed using the WHO Composite International Diagnostic Interview (WMH-CIDI) (n = 5,037). Estimation focused on prevalence and correlates of past-year alcohol disturbances [heavy drinking of lower frequency (HDLF), heavy drinking of higher frequency (HDHF), abuse, dependence, and DMS-5 AUD] among regular users (RU); odds ratio (OR) were obtained. Results Higher NSD, measured as an area-level variable with individual level variables held constant, showed an excess odds for most alcohol disturbances analyzed. Prevalence estimates for HDLF and HDHF among RU were 9% and 20%, respectively, with excess odds in higher NSD areas; schooling (inverse association) and low income were associated with male HDLF. The only individual-level association with female HDLF involved employment status. Prevalence estimates for abuse, dependence, and DSM-5 AUD among RU were 8%, 4%, and 8%, respectively, with excess odds of: dependence in higher NSD areas for males; abuse and AUD for females. Among RU, AUD was associated with unemployment, and low education with dependence and AUD.Public health, Social research, Statisticsssm2183EpidemiologyArticlesLearning Structure in Time Series for Neuroscience and Beyond
http://academiccommons.columbia.edu/catalog/ac:180952
Pfau, David Benjaminhttp://dx.doi.org/10.7916/D8WH2NRRThu, 04 Dec 2014 00:00:00 +0000Advances in neuroscience are producing data at an astounding rate - data which are fiendishly complex both to process and to interpret. Biological neural networks are high-dimensional, nonlinear, noisy, heterogeneous, and in nearly every way defy the simplifying assumptions of standard statistical methods. In this dissertation we address a number of issues with understanding the structure of neural populations, from the abstract level of how to uncover structure in generic time series, to the practical matter of finding relevant biological structure in state-of-the-art experimental techniques. To learn the structure of generic time series, we develop a new statistical model, which we dub the probabilistic deterministic infinite automata (PDIA), which uses tools from nonparametric Bayesian inference to learn a very general class of sequence models. We show that the models learned by the PDIA often offer better predictive performance and faster inference than Hidden Markov Models, while being significantly more compact than models that simply memorize contexts. For large populations of neurons, models like the PDIA become unwieldy, and we instead investigate ways to robustly reduce the dimensionality of the data. In particular, we adapt the generalized linear model (GLM) framework for regres- sion to the case of matrix completion, which we call the low-dimensional GLM. We show that subspaces and dynamics of neural activity can be accurately recovered from model data, and with only minimal assumptions about the structure of the dynamics can still lead to good predictive performance on real data. Finally, to bridge the gap between recording technology and analysis, particularly as recordings from ever-larger populations of neurons becomes the norm, automated methods for extracting activity from raw recordings become a necessity. We present a number of methods for automatically segmenting biological units from optical imaging data, with applications to light sheet recording of genetically encoded calcium indicator fluorescence in the larval zebrafish, and optical electrophysiology using genetically encoded voltage indicators in culture. Together, these methods are a powerful set of tools for addressing the diverse challenges of modern neuroscience.Neurosciences, Statisticsdbp2112Neurobiology and BehaviorDissertationsMethods for handling measurement error and sources of variation in functional data models
http://academiccommons.columbia.edu/catalog/ac:191573
Cai, Xiaochenhttp://dx.doi.org/10.7916/D8M907CJFri, 21 Nov 2014 00:00:00 +0000The overall theme of this thesis work concerns the problem of handling measurement error and sources of variation in functional data models. The first part introduces a wavelet-based sparse principal component analysis approach for characterizing the variability of multilevel functional data that are characterized by spatial heterogeneity and local features. The total covariance of the data can be decomposed into three hierarchical levels: between subjects, between sessions and measurement error. Sparse principal component analysis in the wavelet domain allows for reducing dimension and deriving main directions of random effects that may vary for each hierarchical level. The method is illustrated by application to data from a study of human vision. The second part considers the problem of scalar-on-function regression when the functional regressors are observed with measurement error. We develop a simulation-extrapolation method for scalar-on-function regression, which first estimates the error variance, establishes the relationship between a sequence of added error variance and the corresponding estimates of coefficient functions, and then extrapolates to the zero-error. We introduce three methods to extrapolate the sequence of estimated coefficient functions. In a simulation study, we compare the performance of the simulation-extrapolation method with two pre-smoothing methods based on smoothing splines and functional principal component analysis. The third part discusses several extensions of the simulation-extrapolation method developed in the second part. Some of the extensions are illustrated by application to diffusion tensor imaging data.Biostatistics, Statistics, Biometry, Error analysis (Mathematics), Analysis of covariancexc2214BiostatisticsDissertationsPreaching to the Unconverted
http://academiccommons.columbia.edu/catalog/ac:179470
Uriarte, Maria; Yackulic, Charles B.http://dx.doi.org/10.7916/D8SB44FMSun, 09 Nov 2014 00:00:00 +0000Rapid advances in computing in the past 20 years have lead to an explosion in the development and adoption of new statistical modeling tools (Gelman and Hill 2006, Clark 2007, Bolker 2008, Cressie et al. 2009). These innovations have occurred in parallel with a tremendous increase in the availability of ecological data. The latter has been fueled both by new tools that have facilitated data collection and management efforts (e.g., remote sensing, database management software, and so on) and by increased ease of data sharing through computers and the World Wide Web. The impending implementation of the National Ecological Observatory Network (NEON) will further boost data availability. These rapid advances in the ability of ecologists to collect data have not been matched by application of modern statistical tools. Given the critical questions ecology is facing (e.g., climate change, species extinctions, spread of invasives, irreversible losses of ecosystem services) and the benefits that can be gained from connecting existing data to models in a sophisticated inferential framework (Clark et al. 2001, Pielke and Connant 2003), it is important to understand why this mismatch exists. Such an understanding would point to the issues that must be addressed if ecologists are to make useful inferences from these new data and tools and contribute in substantial ways to management and decision making.Ecology, Statisticsmu2126Ecology, Evolution, and Environmental BiologyArticlesSPAr package for Fan and Lo (2013) "A robust model-free approach for rare variants association studies incorporating gene-gene and gene-environmental interactions."
http://academiccommons.columbia.edu/catalog/ac:179424
Fan, Ruixue; Lo, Shaw-Hwahttp://dx.doi.org/10.7916/D84Q7SN6Fri, 07 Nov 2014 00:00:00 +0000Recently more and more evidence suggest that rare variants with much lower minor allele frequencies play significant roles in disease etiology. Advances in next-generation sequencing technologies will lead to many more rare variants association studies. Several statistical methods have been proposed to assess the effect of rare variants by aggregating information from multiple loci across a genetic region and testing the association between the phenotype and aggregated genotype. One limitation of existing methods is that they only look into the marginal effects of rare variants but do not systematically take into account effects due to interactions among rare variants and between rare variants and environmental factors. In this article, we propose the summation of partition approach (SPA), a robust model-free method that is designed specifically for detecting both marginal effects and effects due to gene-gene (G×G) and gene-environmental (G×E) interactions for rare variants association studies. SPA has three advantages. First, it accounts for the interaction information and gains considerable power in the presence of unknown and complicated G×G or G×E interactions. Secondly, it does not sacrifice the marginal detection power; in the situation when rare variants only have marginal effects it is comparable with the most competitive method in current literature. Thirdly, it is easy to extend and can incorporate more complex interactions; other practitioners and scientists can tailor the procedure to fit their own study friendly. Our simulation studies show that SPA is considerably more powerful than many existing methods in the presence of G×G and G×E interactions. This package is also maintained on the Comprehensive R Archive Network (http://cran.r-project.org). It contains the R programs, user's manual and example codes.Genetics, Statisticsrf2283, shl5StatisticsComputer softwareSource codes for GLMLE algorithm
http://academiccommons.columbia.edu/catalog/ac:178966
Zheng, Tian; He, Ranhttp://dx.doi.org/10.7916/D8HH6HQRFri, 24 Oct 2014 00:00:00 +0000These are the R source codes for the algorithm proposed for fitting exponential random graph models (ERGMs) on large social networks in our paper "Estimation of exponential random graph models for large social networks via graph limits". Specifically, the ERGM model we implement is the one that consider homomorphism densities of edges, two-stars and triangles, the one we examine in the above paper.Statistics, Computer sciencetz33, rh2528StatisticsComputer softwareMarkov Clustering on Person-to-Person Similarity Graph: Attribution of Movies’ Box Office Results to Preferences of Viewer Communities
http://academiccommons.columbia.edu/catalog/ac:177703
Tkachenko, Yegorhttp://dx.doi.org/10.7916/D87M06G5Mon, 29 Sep 2014 00:00:00 +0000Search for methods of deriving actionable marketing segmentation has a long history in the marketing literature. This work proposes the use of Markov clustering algorithm on person-to-person similarity graph, where similarity between individuals is based on their similarity in rating assignments. This allows the detection of taste-based communities of users. Simple regression analysis is subsequently applied to detect the dependencies of box office results of movies of various genres on the preferences of specific viewer communities. The resulting analysis permitted identification of communities that drive box office results of specific movie genres.Business, Marketing, Statisticsit2206BusinessMaster's thesesUsing individual growth model to analyze the change in quality of life from adolescence to adulthood
http://academiccommons.columbia.edu/catalog/ac:192019
Chen, Henian; Cohen, Patricia R.http://dx.doi.org/10.7916/D8805135Tue, 09 Sep 2014 00:00:00 +0000Background: The individual growth model is a relatively new statistical technique now widely used to examine the unique trajectories of individuals and groups in repeated measures data. This technique is increasingly used to analyze the changes over time in quality of life (QOL) data. This study examines the change from adolescence to adulthood in physical health as an aspect of QOL as an illustration of the use of this analytic method. Methods: Employing data from the Children in the Community (CIC) study, a prospective longitudinal investigation, physical health was assessed at mean ages 16, 22, and 33 in 752 persons born between 1965 and 1975. Results: The analyses using individual growth models show a linear decline in average physical health from age 10 to age 40. Males reported better physical health and declined less per year on average. Time-varying psychiatric disorders accounted for 8.6% of the explained variation in mean physical health, and 6.7% of the explained variation in linear change in physical health. Those with such a disorder reported lower mean physical health and a more rapid decline with age than those without a current psychiatric disorder. The use of SAS PROC MIXED, including syntax and interpretation of output are provided. Applications of these models including statistical assumptions, centering issues and cohort effects are discussed. Conclusion: This paper highlights the usefulness of the individual growth model in modeling longitudinal change in QOL variables.Health sciences, Aging, Statistics, Young adults--Health and hygiene, Human growth--Mathematical models, Quality of life--Statistical methodsprc2Epidemiology, PsychiatryArticlesBAMarray™: Java software for Bayesian analysis of variance for microarray data
http://academiccommons.columbia.edu/catalog/ac:192099
Ishwaran, Hemant; Rao, J. Sunil; Kogalur, Udaya B.http://dx.doi.org/10.7916/D8BR8QNZTue, 09 Sep 2014 00:00:00 +0000Background: DNA microarrays open up a new horizon for studying the genetic determinants of disease. The high throughput nature of these arrays creates an enormous wealth of information, but also poses a challenge to data analysis. Inferential problems become even more pronounced as experimental designs used to collect data become more complex. An important example is multigroup data collected over different experimental groups, such as data collected from distinct stages of a disease process. We have developed a method specifically addressing these issues termed Bayesian ANOVA for microarrays (BAM). The BAM approach uses a special inferential regularization known as spike-and-slab shrinkage that provides an optimal balance between total false detections and total false non-detections. This translates into more reproducible differential calls. Spike and slab shrinkage is a form of regularization achieved by using information across all genes and groups simultaneously. Results: BAMarray™ is a graphically oriented Java-based software package that implements the BAM method for detecting differentially expressing genes in multigroup microarray experiments (up to 256 experimental groups can be analyzed). Drop-down menus allow the user to easily select between different models and to choose various run options. BAMarray™ can also be operated in a fully automated mode with preselected run options. Tuning parameters have been preset at theoretically optimal values freeing the user from such specifications. BAMarray™ provides estimates for gene differential effects and automatically estimates data adaptive, optimal cutoff values for classifying genes into biological patterns of differential activity across experimental groups. A graphical suite is a core feature of the product and includes diagnostic plots for assessing model assumptions and interactive plots that enable tracking of prespecified gene lists to study such things as biological pathway perturbations. The user can zoom in and lasso genes of interest that can then be saved for downstream analyses. Conclusion: BAMarray™ is user friendly platform independent software that effectively and efficiently implements the BAM methodology. Classifying patterns of differential activity is greatly facilitated by a data adaptive cutoff rule and a graphical suite. BAMarray™ is licensed software freely available to academic institutions. More information can be found at http://www.bamarray.com.Statistics, Information technology, Bioinformatics, Bayesian statistical decision theory, DNA microarrays--Data processing, Java (Computer program language), Bioinformaticsubk2101StatisticsArticlesPAGE: Parametric Analysis of Gene Set Enrichment
http://academiccommons.columbia.edu/catalog/ac:194039
Kim, Seon-Young; Volsky, David Julianhttp://dx.doi.org/10.7916/D84X568JTue, 09 Sep 2014 00:00:00 +0000Background: Gene set enrichment analysis (GSEA) is a microarray data analysis method that uses predefined gene sets and ranks of genes to identify significant biological changes in microarray data sets. GSEA is especially useful when gene expression changes in a given microarray data set is minimal or moderate. Results: We developed a modified gene set enrichment analysis method based on a parametric statistical analysis model. Compared with GSEA, the parametric analysis of gene set enrichment (PAGE) detected a larger number of significantly altered gene sets and their p-values were lower than the corresponding p-values calculated by GSEA. Because PAGE uses normal distribution for statistical inference, it requires less computation than GSEA, which needs repeated computation of the permutated data set. PAGE was able to detect significantly changed gene sets from microarray data irrespective of different Affymetrix probe level analysis methods or different microarray platforms. Comparison of two aged muscle microarray data sets at gene set level using PAGE revealed common biological themes better than comparison at individual gene level. Conclusion: PAGE was statistically more sensitive and required much less computational effort than GSEA, it could identify significantly changed biological themes from microarray data irrespective of analysis methods or microarray platforms, and it was useful in comparison of multiple microarray data sets. We offer PAGE as a useful microarray analysis method.Bioinformatics, Biostatistics, Genetics, DNA microarrays--Data processing, Bioinformatics--Methodology, Statisticsdjv4Pathology and Cell BiologyArticlesNew insights into old methods for identifying causal rare variants
http://academiccommons.columbia.edu/catalog/ac:195277
Wang, Haitian; Huang, Chien-Hsun; Lo, Shaw-Hwa; Zheng, Tian; Hu, Inchihttp://dx.doi.org/10.7916/D8J38R1MTue, 09 Sep 2014 00:00:00 +0000The advance of high-throughput next-generation sequencing technology makes possible the analysis of rare variants. However, the investigation of rare variants in unrelated-individuals data sets faces the challenge of low power, and most methods circumvent the difficulty by using various collapsing procedures based on genes, pathways, or gene clusters. We suggest a new way to identify causal rare variants using the F-statistic and sliced inverse regression. The procedure is tested on the data set provided by the Genetic Analysis Workshop 17 (GAW17). After preliminary data reduction, we ranked markers according to their F-statistic values. Top-ranked markers were then subjected to sliced inverse regression, and those with higher absolute coefficients in the most significant sliced inverse regression direction were selected. The procedure yields good false discovery rates for the GAW17 data and thus is a promising method for future study on rare variants.Biostatisics, Statistics, Statistics--Methodology, Human genetics--Variation, Biometry--Statistical methodsshl5, tz33StatisticsArticlesCopy number variation genotyping using family information
http://academiccommons.columbia.edu/catalog/ac:180080
Chu, Jen-hwa; Rogers, Angela; Ionita-Laza, Iuliana; Darvishi, Katayoon; Mills, Ryan E.; Lee, Charles; Raby, Benjamin A.http://dx.doi.org/10.7916/D8HD7T0DMon, 08 Sep 2014 00:00:00 +0000Background: In recent years there has been a growing interest in the role of copy number variations (CNV) in genetic diseases. Though there has been rapid development of technologies and statistical methods devoted to detection in CNVs from array data, the inherent challenges in data quality associated with most hybridization techniques remains a challenging problem in CNV association studies. Results: To help address these data quality issues in the context of family-based association studies, we introduce a statistical framework for the intensity-based array data that takes into account the family information for copy-number assignment. The method is an adaptation of traditional methods for modeling SNP genotype data that assume Gaussian mixture model, whereby CNV calling is performed for all family members simultaneously and leveraging within family-data to reduce CNV calls that are incompatible with Mendelian inheritance while still allowing de-novo CNVs. Applying this method to simulation studies and a genome-wide association study in asthma, we find that our approach significantly improves CNV calls accuracy, and reduces the Mendelian inconsistency rates and false positive genotype calls. The results were validated using qPCR experiments. Conclusions: In conclusion, we have demonstrated that the use of family information can improve the quality of CNV calling and hopefully give more powerful association test of CNVs.Genetics, Statisticsii2135Mailman School of Public Health, BiostatisticsArticlesHelping the decision maker effectively promote various experts’ views into various optimal solutions to China’s institutional problem of health care provider selection.
http://academiccommons.columbia.edu/catalog/ac:180132
Tang, Liyanghttp://dx.doi.org/10.7916/D8BP0147Mon, 08 Sep 2014 00:00:00 +0000Background: The main aim of China’s Health Care System Reform was to help the decision maker find the optimal solution to China’s institutional problem of health care provider selection. A pilot health care provider research system was recently organized in China’s health care system, and it could efficiently collect the data for determining the optimal solution to China’s institutional problem of health care provider selection from various experts, then the purpose of this study was to apply the optimal implementation methodology to help the decision maker effectively promote various experts’ views into various optimal solutions to this problem under the support of this pilot system. Methods: After the general framework of China’s institutional problem of health care provider selection was established, this study collaborated with the National Bureau of Statistics of China to commission a large-scale 2009 to 2010 national expert survey (n = 3,914) through the organization of a pilot health care provider research system for the first time in China, and the analytic network process (ANP) implementation methodology was adopted to analyze the dataset from this survey. Results: The market-oriented health care provider approach was the optimal solution to China’s institutional problem of health care provider selection from the doctors’ point of view; the traditional government’s regulation-oriented health care provider approach was the optimal solution to China’s institutional problem of health care provider selection from the pharmacists’ point of view, the hospital administrators’ point of view, and the point of view of health officials in health administration departments; the public private partnership (PPP) approach was the optimal solution to China’s institutional problem of health care provider selection from the nurses’ point of view, the point of view of officials in medical insurance agencies, and the health care researchers’ point of view. Conclusions: The data collected through a pilot health care provider research system in the 2009 to 2010 national expert survey could help the decision maker effectively promote various experts’ views into various optimal solutions to China’s institutional problem of health care provider selection.Statistics, BusinessBusinessArticlesReporting of analyses from randomized controlled trials with multiple arms: a systematic review
http://academiccommons.columbia.edu/catalog/ac:180137
Baron, Gabriel; Perrodeau, Elodie; Boutron, Isabelle; Ravaud, Philippehttp://dx.doi.org/10.7916/D837772TMon, 08 Sep 2014 00:00:00 +0000Background: Multiple-arm randomized trials can be more complex in their design, data analysis, and result reporting than two-arm trials. We conducted a systematic review to assess the reporting of analyses in reports of randomized controlled trials (RCTs) with multiple arms. Methods: The literature in the MEDLINE database was searched for reports of RCTs with multiple arms published in 2009 in the core clinical journals. Two reviewers extracted data using a standardized extraction form. Results: In total, 298 reports were identified. Descriptions of the baseline characteristics and outcomes per group were missing in 45 reports (15.1%) and 48 reports (16.1%), respectively. More than half of the articles (n = 171, 57.4%) reported that a planned global test comparison was used (that is, assessment of the global differences between all groups), but 67 (39.2%) of these 171 articles did not report details of the planned analysis. Of the 116 articles reporting a global comparison test, 12 (10.3%) did not report the analysis as planned. In all, 60% of publications (n = 180) described planned pairwise test comparisons (that is, assessment of the difference between two groups), but 20 of these 180 articles (11.1%) did not report the pairwise test comparisons. Of the 204 articles reporting pairwise test comparisons, the comparisons were not planned for 44 (21.6%) of them. Less than half the reports (n = 137; 46%) provided baseline and outcome data per arm and reported the analysis as planned. Conclusions: Our findings highlight discrepancies between the planning and reporting of analyses in reports of multiple-arm trials.Statistics, Health sciencespr2341EpidemiologyArticlesExpressionPlot: a web-based framework for analysis of RNA-Seq and microarray gene expression data
http://academiccommons.columbia.edu/catalog/ac:183139
Friedman, Brad; Maniatis, Tomhttp://dx.doi.org/10.7916/D82J6979Mon, 08 Sep 2014 00:00:00 +0000RNA-Seq and microarray platforms have emerged as important tools for detecting changes in gene expression and RNA processing in biological samples. We present ExpressionPlot, a software package consisting of a default back end, which prepares raw sequencing or Affymetrix microarray data, and a web-based front end, which offers a biologically centered interface to browse, visualize, and compare different data sets. Download and installation instructions, a user's manual, discussion group, and a prototype are available at http://expressionplot.comStatistics, Bioinformaticstm2472Biochemistry and Molecular BiophysicsArticlesHydroclimatology of Extreme Precipitation and Floods Originating from the North Atlantic Ocean
http://academiccommons.columbia.edu/catalog/ac:177151
Nakamura, Jennifer Annehttp://dx.doi.org/10.7916/D86H4FM1Fri, 15 Aug 2014 00:00:00 +0000This study explores seasonal patterns and structures of moisture transport pathways from the North Atlantic Ocean and the Gulf of Mexico that lead to extreme large-scale precipitation and floods over land. Storm tracks, such as the tropical cyclone tracks in the Northern Atlantic Ocean, are an example of moisture transport pathways. In the first part, North Atlantic cyclone tracks are clustered by the moments to identify common traits in genesis locations, track shapes, intensities, life spans, landfalls, seasonal patterns, and trends. The clustering results of part one show the dynamical behavior differences of tropical cyclones born in different parts of the basin. Drawing on these conclusions, in the second part, statistical track segment model is developed for simulation of tracks to improve reliability of tropical cyclone risk probabilities. Moisture transport pathways from the North Atlantic Ocean are also explored though the specific regional flood dynamics of the U.S. Midwest and the United Kingdom in part three of the dissertation. Part I. Classifying North Atlantic Tropical Cyclones Tracks by Mass Moments. A new method for classifying tropical cyclones or similar features is introduced. The cyclone track is considered as an open spatial curve, with the wind speed or power information along the curve considered as a mass attribute. The first and second moments of the resulting object are computed and then used to classify the historical tracks using standard clustering algorithms. Mass moments allow the whole track shape, length and location to be incorporated into the clustering methodology. Tropical cyclones in the North Atlantic basin are clustered with K-means by mass moments producing an optimum of six clusters with differing genesis locations, track shapes, intensities, life spans, landfalls, seasonality, and trends. Even variables that are not directly clustered show distinct separation between clusters. A trend analysis confirms recent conclusions of increasing tropical cyclones in the basin over the past two decades. However, the trends vary across clusters. Part II: Tropical cyclone Intensity and Track Simulator (HITS) with Atlantic Ocean Applications for Risk Assessment. A nonparametric stochastic model is developed and tested for the simulation of tropical cyclone tracks. Tropical cyclone tracks demonstrate continuity and memory over many time and space steps. Clusters of tracks can be coherent, and the separation between clusters may be marked by geographical locations where groups of tracks diverge due to the physics of the underlying process. Consequently, their evolution may be non-Markovian. Markovian simulation models, as often used, may produce tracks that potentially diverge or lose memory quicker than nature. This is addressed here through a model that simulates tracks by randomly sampling track segments of varying length, selected from historical tracks. For performance evaluation, a spatial grid is imposed on the domain of interest. For each grid box, long-term tropical cyclone risk is assessed through the annual probability distributions of the number of storm hours, landfalls, winds, and other statistics. Total storm length is determined at birth by local distribution, and movement to other tropical cyclone segments by distance to neighbor tracks, comparative vector, and age of track. An assessment of the performance for tropical cyclone track simulation and potential directions for the improvement and use of such model are discussed. Part III: Dynamical Structure of Extreme Floods in the U.S. Midwest and the United Kingdom. Twenty extreme spring floods that occurred in the Ohio Basin between 1901 and 2008, identified from daily river discharge data, are investigated and compared to the April 2011 Ohio River flood event. Composites of synoptic fields for the flood events show that all these floods are associated with a similar pattern of sustained advection of low-level moisture and warm air from the tropical Atlantic Ocean and the Gulf of Mexico. The typical flow conditions are governed by an anomalous semi-stationary ridge situated east of the US East Coast, which steers the moisture and converges it into the Ohio Valley. Significantly, the moisture path common to all the 20 cases studied here as well as the case of April 2011 is distinctly different from the normal path of Atlantic moisture during spring, which occurs further west. It is shown further that the Ohio basin moisture convergence responsible for the floods is caused primarily by the atmospheric circulation anomaly advecting the climatological mean moisture field. Transport and related convergence due to the covariance between moisture anomalies and circulation anomalies are of secondary but non-negligible importance. The importance of atmospheric circulation anomalies to floods is confirmed by conducting a similar analysis for a series of winter floods on the River Eden in northwest England.Atmospheric sciences, Hydrologic sciences, Statisticsjam148Earth and Environmental EngineeringDissertationsLimit Theory for Spatial Processes, Bootstrap Quantile Variance Estimators, and Efficiency Measures for Markov Chain Monte Carlo
http://academiccommons.columbia.edu/catalog/ac:188852
Yang, Xuanhttp://dx.doi.org/10.7916/D84X560ZThu, 07 Aug 2014 00:00:00 +0000This thesis contains three topics: (I) limit theory for spatial processes, (II) asymptotic results on the bootstrap quantile variance estimator for importance sampling, and (III) an efficiency measure of MCMC. (I) First, central limit theorems are obtained for sums of observations from a $\kappa$-weakly dependent random field. In particular, it is considered that the observations are made from a random field at irregularly spaced and possibly random locations. The sums of these samples as well as sums of functions of pairs of the observations are objects of interest; the latter has applications in covariance estimation, composite likelihood estimation, etc. Moreover, examples of $\kappa$-weakly dependent random fields are explored and a method for the evaluation of $\kappa$-coefficients is presented. Next, statistical inference is considered for the stochastic heteroscedastic processes (SHP) which generalize the stochastic volatility time series model to space. A composite likelihood approach is adopted for parameter estimation, where the composite likelihood function is formed by a weighted sum of pairwise log-likelihood functions. In addition, the observations sites are assumed to distributed according to a spatial point process. Sufficient conditions are provided for the maximum composite likelihood estimator to be consistent and asymptotically normal. (II) It is often difficult to provide an accurate estimation for the variance of the weighted sample quantile. Its asymptotic approximation requires the value of the density function which may be hard to evaluate in complex systems. To circumvent this problem, the bootstrap estimator is considered. Theoretical results are established for the exact convergence rate and asymptotic distributions of the bootstrap variance estimators for quantiles of weighted empirical distributions. Under regularity conditions, it is shown that the bootstrap variance estimator is asymptotically normal and has relative standard deviation of order O(n^-1/4) (III) A new performance measure is proposed to evaluate the efficiency of Markov chain Monte Carlo (MCMC) algorithms. More precisely, the large deviations rate of the probability that the Monte Carlo estimator deviates from the true by a certain distance is used as a measure of efficiency of a particular MCMC algorithm. Numerical methods are proposed for the computation of the rate function based on samples of the renewal cycles of the Markov chain. Furthermore the efficiency measure is applied to an array of MCMC schemes to determine their optimal tuning parameters.Statisticsxy2139StatisticsDissertationsConvex Optimization Algorithms and Recovery Theories for Sparse Models in Machine Learning
http://academiccommons.columbia.edu/catalog/ac:175385
Huang, Bohttp://dx.doi.org/10.7916/D8VM49DMMon, 07 Jul 2014 00:00:00 +0000Sparse modeling is a rapidly developing topic that arises frequently in areas such as machine learning, data analysis and signal processing. One important application of sparse modeling is the recovery of a high-dimensional object from relatively low number of noisy observations, which is the main focuses of the Compressed Sensing, Matrix Completion(MC) and Robust Principal Component Analysis (RPCA) . However, the power of sparse models is hampered by the unprecedented size of the data that has become more and more available in practice. Therefore, it has become increasingly important to better harnessing the convex optimization techniques to take advantage of any underlying "sparsity" structure in problems of extremely large size. This thesis focuses on two main aspects of sparse modeling. From the modeling perspective, it extends convex programming formulations for matrix completion and robust principal component analysis problems to the case of tensors, and derives theoretical guarantees for exact tensor recovery under a framework of strongly convex programming. On the optimization side, an efficient first-order algorithm with the optimal convergence rate has been proposed and studied for a wide range of problems of linearly constraint sparse modeling problems.Mathematics, Statistics, Operations researchIndustrial Engineering and Operations ResearchDissertationsEstimating the Q-matrix for Cognitive Diagnosis Models in a Bayesian Framework
http://academiccommons.columbia.edu/catalog/ac:176107
Chung, Meng-tahttp://dx.doi.org/10.7916/D857195BMon, 07 Jul 2014 00:00:00 +0000This research aims to develop an MCMC algorithm for estimating the Q-matrix in a Bayesian framework. A saturated multinomial model was used to estimate correlated attributes in the DINA model and rRUM. Closed-forms of posteriors for guess and slip parameters were derived for the DINA model. The random walk Metropolis-Hastings algorithm was applied to parameter estimation in the rRUM. An algorithm for reducing potential label switching was incorporated into the estimation procedure. A method for simulating data with correlated attributes for the DINA model and rRUM was offered. Three simulation studies were conducted to evaluate the algorithm for Bayesian estimation. Twenty simulated data sets for simulation study 1 were generated from independent attributes for the DINA model and rRUM. A hundred data sets from correlated attributes were generated for the DINA and rRUM with guess and slip parameters set to 0.2 in simulation study 2. Simulation study 3 analyzed data sets simulated from the DINA model with guess and slip parameters generated from Uniform (0.1, 0.4). Results from simulation studies showed that the Q-matrix recovery rate was satisfactory. Using the fraction-subtraction data, an empirical study was conducted for the DINA model and rRUM. The estimated Q-matrices from the two models were compared with the expert-designed Q-matrix.Quantitative psychology and psychometrics, Statistics, Educational tests and measurementsHuman Development, Measurement and EvaluationDissertationsUnbiased Penetrance Estimates with Unknown Ascertainment Strategies
http://academiccommons.columbia.edu/catalog/ac:175879
Gore, Kristenhttp://dx.doi.org/10.7916/D8KP8098Mon, 07 Jul 2014 00:00:00 +0000Allelic variation in the genome leads to variation in individuals' production of proteins. This, in turn, leads to variation in traits and development, and, in some cases, to diseases. Understanding the genetic basis for disease can aid in the search for therapies and in guiding genetic counseling. Thus, it is of interest to discover the genes with mutations responsible for diseases and to understand the impact of allelic variation at those genes. A subject's genetic composition is commonly referred to as the subject's genotype. Subjects who carry the gene mutation of interests are referred to as carriers. Subjects who are afflicted with a disease under study (that is, subjects who exhibit the phenotype) are termed affected carriers. The age-specific probability that a given subject will exhibit a phenotype of interest, given mutation status at a gene is known as penetrance. Understanding penetrance is an important facet of genetic epidemiology. Penetrance estimates are typically calculated via maximum likelihood from family data. However, penetrance estimates can be biased if the nature of the sampling strategy is not correctly reflected in the likelihood. Unfortunately, sampling of family data may be conducted in a haphazard fashion or, even if conducted systematically, might be reported in an incomplete fashion. Bias is possible in applying likelihood methods to reported data if (as is commonly the case) some unaffected family members are not represented in the reports. The purpose here is to present an approach to find efficient and unbiased penetrance estimates in cases where there is incomplete knowledge of the sampling strategy and incomplete information on the full pedigree structure of families included in the data. The method may be applied with different conjectural assumptions about the ascertainment strategy to balance the possibly biasing effects of wishful assumptions about the sampling strategy with the efficiency gains that could be obtained through valid assumptions.StatisticsStatisticsDissertationsStatistical Inference and Experimental Design for Q-matrix Based Cognitive Diagnosis Models
http://academiccommons.columbia.edu/catalog/ac:176169
Zhang, Stephaniehttp://dx.doi.org/10.7916/D8TQ5ZP5Mon, 07 Jul 2014 00:00:00 +0000There has been growing interest in recent years in using cognitive diagnosis models for diagnostic measurement, i.e., classification according to multiple discrete latent traits. The Q-matrix, an incidence matrix specifying the presence or absence of a relationship between each item in the assessment and each latent attribute, is central to many of these models. Important applications include educational and psychological testing; demand in education, for example, has been driven by recent focus on skills-based evaluation. However, compared to more traditional models coming from classical test theory and item response theory, cognitive diagnosis models are relatively undeveloped and suffer from several issues limiting their applicability. This thesis exams several issues related to statistical inference and experimental design for Q-matrix based cognitive diagnosis models. We begin by considering one of the main statistical issues affecting the practical use of Q-matrix based cognitive diagnosis models, the identifiability issue. In statistical models, identifiability is prerequisite for most common statistical inferences, including parameter estimation and hypothesis testing. With Q-matrix based cognitive diagnosis models, identifiability also affects the classification of respondents according to their latent traits. We begin by examining the identifiability of model parameters, presenting necessary and sufficient conditions for identifiability in several settings. Depending on the area of application and the researcher's degree of control over the experiment design, fulfilling these identifiability conditions may be difficult. The second part of this thesis proposes new methods for parameter estimation and respondent classification for use with non-identifiable models. In addition, our framework allows consistent estimation of the severity of the non-identifiability problem, in terms of the proportion of the population affected by it. The implications of this measure for the design of diagnostic assessments are also discussed.Statistics, Educational tests and measurements, Quantitative psychology and psychometricsStatisticsDissertationsPopulation Genetics of Identity By Descent
http://academiccommons.columbia.edu/catalog/ac:175990
Palamara, Pier Francescohttp://dx.doi.org/10.7916/D8V122XTMon, 07 Jul 2014 00:00:00 +0000Recent improvements in high-throughput genotyping and sequencing technologies have afforded the collection of massive, genome-wide datasets of DNA information from hundreds of thousands of individuals. These datasets, in turn, provide unprecedented opportunities to reconstruct the history of human populations and detect genotype-phenotype association. Recently developed computational methods can identify long-range chromosomal segments that are identical across samples, and have been transmitted from common ancestors that lived tens to hundreds of generations in the past. These segments reveal genealogical relationships that are typically unknown to the carrying individuals. In this work, we demonstrate that such identical-by-descent (IBD) segments are informative about a number of relevant population genetics features: they enable the inference of details about past population size fluctuations, migration events, and they carry the genomic signature of natural selection. We derive a mathematical model, based on coalescent theory, that allows for a quantitative description of IBD sharing across purportedly unrelated individuals, and develop inference procedures for the reconstruction of recent demographic events, where classical methodologies are statistically underpowered. We analyze IBD sharing in several contemporary human populations, including representative communities of the Jewish Diaspora, Kenyan Maasai samples, and individuals from several Dutch provinces, in all cases retrieving evidence of fine-scale demographic events from recent history. Finally, we expand the presented model to describe distributions for those sites in IBD shared segments that harbor mutation events, showing how these may be used for the inference of mutation rates in humans and other species.Genetics, Computer science, Statisticspp2314Computer ScienceDissertationsA Characterization of Markov Equivalence Classes for Acyclic Digraphs
http://academiccommons.columbia.edu/catalog/ac:173896
Andersson, Steen A.; Madigan, David B.; Perlman, Michael D.http://dx.doi.org/10.7916/D8FX77J3Thu, 15 May 2014 00:00:00 +0000Undirected graphs and acyclic digraphs (ADG's), as well as their mutual extension to chain graphs, are widely used to describe dependencies among variables in multiviarate distributions. In particular, the likelihood functions of ADG models admit convenient recursive factorizations that often allow explicit maximum likelihood estimates and that are well suited to building Bayesian networks for expert systems. Whereas the undirected graph associated with a dependence model is uniquely determined, there may be many ADG's that determine the same dependence (i.e., Markov) model. Thus, the family of all ADG's with a given set of vertices is naturally partitioned into Markov-equivalence classes, each class being associated with a unique statistical model. Statistical procedures, such as model selection of model averaging, that fail to take into account these equivalence classes may incur substantial computational or other inefficiences. Here it is show that each Markov-equivalence class is uniquely determined by a single chain graph, the essential graph, that is itself simultaneously Markov equivalent to all ADG's in the equivalence class. Essential graphs are characterized, a polynomial-time algorithm for their construction is given, and their applications to model selection and other statistical questions are described.Mathematics, Statistics, Theoretical mathematicsdm2418StatisticsArticlesLearning Theory Analysis for Association Rules and Sequential Event Prediction
http://academiccommons.columbia.edu/catalog/ac:173905
Rudin, Cynthia; Letham, Benjamin; Madigan, David B.http://dx.doi.org/10.7916/D82N50C1Thu, 15 May 2014 00:00:00 +0000We present a theoretical analysis for prediction algorithms based on association rules. As part of this analysis, we introduce a problem for which rules are particularly natural, called “sequential event prediction." In sequential event prediction, events in a sequence are revealed one by one, and the goal is to determine which event will next be revealed. The training set is a collection of past sequences of events. An example application is to predict which item will next be placed into a customer's online shopping cart, given his/her past purchases. In the context of this problem, algorithms based on association rules have distinct advantages over classical statistical and machine learning methods: they look at correlations based on subsets of co-occurring past events (items a and b imply item c), they can be applied to the sequential event prediction problem in a natural way, they can potentially handle the “cold start" problem where the training set is small, and they yield interpretable predictions. In this work, we present two algorithms that incorporate association rules. These algorithms can be used both for sequential event prediction and for supervised classification, and they are simple enough that they can possibly be understood by users, customers, patients, managers, etc. We provide generalization guarantees on these algorithms based on algorithmic stability analysis from statistical learning theory. We include a discussion of the strict minimum support threshold often used in association rule mining, and introduce an “adjusted confidence" measure that provides a weaker minimum support condition that has advantages over the strict minimum support. The paper brings together ideas from statistical learning theory, association rule mining and Bayesian analysis.Statistics, Artificial intelligencedm2418StatisticsArticlesAnalysis of Variance of Cross-Validation Estimators of the Generalization Error
http://academiccommons.columbia.edu/catalog/ac:173902
Markatou, Marianthi; Tian, Hong; Biswas, Shameek; Hripcsak, George M.http://dx.doi.org/10.7916/D86D5R2XThu, 15 May 2014 00:00:00 +0000This paper brings together methods from two different disciplines: statistics and machine learning. We address the problem of estimating the variance of cross-validation (CV) estimators of the generalization error. In particular, we approach the problem of variance estimation of the CV estimators of generalization error as a problem in approximating the moments of a statistic. The approximation illustrates the role of training and test sets in the performance of the algorithm. It provides a unifying approach to evaluation of various methods used in obtaining training and test sets and it takes into account the variability due to different training and test sets. For the simple problem of predicting the sample mean and in the case of smooth loss functions, we show that the variance of the CV estimator of the generalization error is a function of the moments of the random variables Y=Card(Sj ∩ Sj') and Y*=Card(Sjc ∩ Sj'c), where Sj, Sj' are two training sets, and Sjc, Sj'c are the corresponding test sets. We prove that the distribution of Y and Y* is hypergeometric and we compare our estimator with the one proposed by Nadeau and Bengio (2003). We extend these results in the regression case and the case of absolute error loss, and indicate how the methods can be extended to the classification case. We illustrate the results through simulation.Statistics, Artificial intelligencemm168, ht2031, spb2003, gh13Statistics, Biomedical Informatics, BiostatisticsArticlesAlgorithms for Sparse Linear Classifiers in the Massive Data Setting
http://academiccommons.columbia.edu/catalog/ac:173908
Balakrishnan, Suhrid; Bartlett, Peter; Madigan, David B.http://dx.doi.org/10.7916/D8Z0368XThu, 15 May 2014 00:00:00 +0000Classifiers favoring sparse solutions, such as support vector machines, relevance vector machines, LASSO-regression based classifiers, etc., provide competitive methods for classification problems in high dimensions. However, current algorithms for training sparse classifiers typically scale quite unfavorably with respect to the number of training examples. This paper proposes online and multi-pass algorithms for training sparse linear classifiers for high dimensional data. These algorithms have computational complexity and memory requirements that make learning on massive data sets feasible. The central idea that makes this possible is a straightforward quadratic approximation to the likelihood function.Statistics, Artificial intelligencedm2418StatisticsArticlesMedication-Wide Association Studies
http://academiccommons.columbia.edu/catalog/ac:173912
Ryan, P. B.; Stang, P. E.; Madigan, David B.; Schuemie, M. J.; Hripcsak, George M.http://dx.doi.org/10.7916/D8PG1PVXThu, 15 May 2014 00:00:00 +0000Undiscovered side effects of drugs can have a profound effect on the health of the nation, and electronic health-care databases offer opportunities to speed up the discovery of these side effects. We applied a “medication-wide association study” approach that combined multivariate analysis with exploratory visualization to study four health outcomes of interest in an administrative claims database of 46 million patients and a clinical database of 11 million patients. The technique had good predictive value, but there was no threshold high enough to eliminate false-positive findings. The visualization not only highlighted the class effects that strengthened the review of specific products but also underscored the challenges in confounding. These findings suggest that observational databases are useful for identifying potential associations that warrant further consideration but are unlikely to provide definitive evidence of causal effects.Pharmacology, Statistics, Bioinformaticsdm2418, gh13Statistics, Biomedical InformaticsArticlesA One-Pass Sequential Monte Carlo Method for Bayesian Analysis of Massive Datasets
http://academiccommons.columbia.edu/catalog/ac:173899
Balakrishnan, Suhrid; Madigan, David B.http://dx.doi.org/10.7916/D8B56GTPThu, 15 May 2014 00:00:00 +0000For Bayesian analysis of massive data, Markov chain Monte Carlo (MCMC) techniques often prove infeasible due to computational resource constraints. Standard MCMC methods generally require a complete scan of the dataset for each iteration. Ridgeway and Madigan (2002) and Chopin (2002b) recently presented importance sampling algorithms that combined simulations from a posterior distribution conditioned on a small portion of the dataset with a reweighting of those simulations to condition on the remainder of the dataset. While these algorithms drastically reduce the number of data accesses as compared to traditional MCMC, they still require substantially more than a single pass over the dataset. In this paper, we present "1PFS," an efficient, one-pass algorithm. The algorithm employs a simple modification of the Ridgeway and Madigan (2002) particle filtering algorithm that replaces the MCMC based "rejuvenation" step with a more efficient "shrinkage" kernel smoothing based step. To show proof-of-concept and to enable a direct comparison, we demonstrate 1PFS on the same examples presented in Ridgeway and Madigan (2002), namely a mixture model for Markov chains and Bayesian logistic regression. Our results indicate the proposed scheme delivers accurate parameter estimates while employing only a single pass through the data.Mathematics, Statisticsdm2418StatisticsArticlesBook Reviews: Principles of Data Mining. By David Hand, Heikki Mannila, and Padhraic Smyth.
http://academiccommons.columbia.edu/catalog/ac:173915
Madigan, David B.http://dx.doi.org/10.7916/D8DZ06D8Thu, 15 May 2014 00:00:00 +0000"Principles of Data Mining. By David Hand, Heikki Mannila, and Padhraic Smyth. MIT Press, Cambridge, MA, 2001. $50.00. xxxii+546 pp., hardcover. ISBN 0-262-08290-X. Is data mining the same as statistics? The distinguished authors of Principles of Data Mining struggle to make a distinction between the two subjects. In the end, what they have written is a fine applied statistics text." -- page 501Statisticsdm2418StatisticsReviewsCorrection: Separation and completeness properties for AMP chain graph Markov models
http://academiccommons.columbia.edu/catalog/ac:173887
Madigan, David B.; Levitz, Michael; Perlman, Michael D.http://dx.doi.org/10.7916/D8QF8R05Wed, 14 May 2014 00:00:00 +0000Correction of table 2 on page 1757 of 'Separation and completeness properties for AMP chain graph Markov models', Annals of Statistics, volume 29 (2001).Mathematics, Statisticsdm2418StatisticsArticlesBayesian Hierarchical Rule Modeling for Predicting Medical Conditions
http://academiccommons.columbia.edu/catalog/ac:173882
McCormick, Tyler H.; Rudin, Cynthia; Madigan, David B.http://dx.doi.org/10.7916/D8V69GP1Wed, 14 May 2014 00:00:00 +0000We propose a statistical modeling technique, called the Hierarchical Association Rule Model (HARM), that predicts a patient’s possible future medical conditions given the patient’s current and past history of reported conditions. The core of our technique is a Bayesian hierarchical model for selecting predictive association rules (such as “condition 1 and condition 2 → condition 3”) from a large set of candidate rules. Because this method “borrows strength” using the conditions of many similar patients, it is able to provide predictions specialized to any given patient, even when little information about the patient’s history of conditions is available.Applied mathematics, Statistics, Medicinedm2418StatisticsArticlesSeparation and Completeness Properties for Amp Chain Graph Markov Models
http://academiccommons.columbia.edu/catalog/ac:173847
Levitz, Michael; Perlman, Michael D.; Madigan, David B.http://dx.doi.org/10.7916/D8X34VJGTue, 13 May 2014 00:00:00 +0000Pearl’s well-known d-separation criterion for an acyclic directed graph (ADG) is a pathwise separation criterion that can be used to efficiently identify all valid conditional independence relations in the Markov model determined by the graph. This paper introduces p-separation, a pathwise separation criterion that efficiently identifies all valid conditional independences under the Andersson–Madigan–Perlman (AMP) alternative Markov property for chain graphs (= adicyclic graphs), which include both ADGs and undirected graphs as special cases. The equivalence of p-separation to the augmentation criterion occurring in the AMP global Markov property is established, and p-separation is applied to prove completeness of the global Markov property for AMP chain graph models. Strong completeness of the AMP Markov property is established, that is, the existence of Markov perfect distributions that satisfy those and only those conditional independences implied by the AMP property(equivalently, by p-separation). A linear-time algorithm for determining p-separation is presented.Mathematics, Statistics, Theoretical mathematicsdm2418StatisticsArticlesBayesian Model Averaging: a Tutorial (with Comments by M. Clyde, David Draper and E. I. George, and a Rejoinder by the Authors)
http://academiccommons.columbia.edu/catalog/ac:173853
Hoeting, Jennifer A.; Madigan, David B.; Raftery, Adrian E.; Volinsky, Chris T.; Clyde, M.; Draper, David; George, E. I.http://dx.doi.org/10.7916/D84M92N7Tue, 13 May 2014 00:00:00 +0000Standard statistical practice ignores model uncertainty. Data analysts typically select a model from some class of models and then proceed as if the selected model had generated the data. This approach ignores the uncertainty in model selection, leading to over-confident inferences and decisions that are more risky than one thinks they are. Bayesian model averaging (BMA)provides a coherent mechanism for accounting for this model uncertainty. Several methods for implementing BMA have recently emerged. We discuss these methods and present a number of examples.In these examples, BMA provides improved out-of-sample predictive performance. We also provide a catalogue of currently available BMA software.Statisticsdm2418StatisticsArticles[Bayesian Analysis in Expert Systems]: Comment: What's Next?
http://academiccommons.columbia.edu/catalog/ac:173856
Madigan, David B.http://dx.doi.org/10.7916/D8W37TFJTue, 13 May 2014 00:00:00 +0000"These papers represent two of the many different graphical modeling camps that have emerged from a flurry of activity in the past decade. The paper by Cox and Wermuth falls within the statistical graphical modeling camp and provides a useful generalization of that body of work. There is, of course, a price to be paid for this generality, namely that the interpretation of the graphs is more complex...The paper by Spiegelhalter, Dawid, Lauritzen and Cowell falls within the probabilistic expert system camp. This is a tour de force by researchers responsible for much of the astonishing progress in this area. Ten years ago, probabilistic models were shunned by the artificial intelligence community. That they are now widely accepted and used is due in large measure to the insights and efforts of these authors, along with other pioneers such as Judea Pearl and Peter Cheeseman..." -- page 261Mathematics, Statisticsdm2418StatisticsArticlesLocation Estimation in Wireless Networks: A Bayesian Approach
http://academiccommons.columbia.edu/catalog/ac:173820
Madigan, David B.; Ju, Wen-Hua; Krishnan, P.; Krishnakumar, A. S. ; Zorych, Ivanhttp://dx.doi.org/10.7916/D82V2D74Tue, 13 May 2014 00:00:00 +0000We present a Bayesian hierarchical model for indoor location estimation in wireless networks. We demonstrate that out model achieves accuracy that is similar to other published models and algorithms. By harnessing prior knowledge, our model drastically reduces the requirement for training data as compared with existing approaches.Mathematics, Statistics, Applied mathematicsdm2418StatisticsArticlesA Flexible Bayesian Generalized Linear Model for Dichotomous Response Data with an Application to Text Categorization
http://academiccommons.columbia.edu/catalog/ac:173817
Eyheramendy, Susana; Madigan, David B.http://dx.doi.org/10.7916/D86M34ZFTue, 13 May 2014 00:00:00 +0000We present a class of sparse generalized linear models that include probit and logistic regression as special cases and offer some extra flexibility. We provide an EM algorithm for learning the parameters of these models from data. We apply our method in text classification and in simulated data and show that our method outperforms the logistic and probit models and also the elastic net, in general by a substantial margin.Mathematics, Statistics, Theoretical mathematicsdm2418StatisticsBook chaptersA Hierarchical Model for Association Rule Mining of Sequential Events: An Approach to Automated Medical Symptom Prediction
http://academiccommons.columbia.edu/catalog/ac:173838
McCormick, Tyler H.; Rudin, Cynthia; Madigan, David B.http://dx.doi.org/10.7916/D89C6VJDTue, 13 May 2014 00:00:00 +0000In many healthcare settings, patients visit healthcare professionals periodically and report multiple medical conditions, or symptoms, at each encounter. We propose a statistical modeling technique, called the Hierarchical Association Rule Model (HARM), that predicts a patient’s possible future symptoms given the patient’s current and past history of reported symptoms. The core of our technique is a Bayesian hierarchical model for selecting predictive association rules (such as “symptom 1 and symptom 2 → symptom 3 ”) from a large set of candidate rules. Because this method “borrows strength” using the symptoms of many similar patients, it is able to provide predictions specialized to any given patient, even when little information about the patient’s history of symptoms is available.Mathematics, Statistics, Medicinedm2418StatisticsArticlesGenerating Productive Dialogue between Consulting Statisticians and their Clients in the Pharmaceutical and Medical Research Settings
http://academiccommons.columbia.edu/catalog/ac:173832
Emir, Birol; Amaratunga, Dhammika; Beltangady, Mohan; Cabrera, Javier; Freeman, Roy; Madigan, David B.; Nguyen, Ha H.; Whalen, Edward Patrickhttp://dx.doi.org/10.7916/D8PK0D8NTue, 13 May 2014 00:00:00 +0000Due to the ever-increasing complexity of scientific technologies and resulting data, consulting statisticians are becoming more involved in the design, conduct, and analysis of biomedical research. This requires extensive collaboration between the consulting statistician and nonstatisticians, such as researchers, clinicians, and corporate executives. Consequently, a successful consulting career is becoming ever more dependent on the statistician's ability to effectively communicate with nonstatisticians. This is especially true when more complex, nontraditional analytical methods are required. In this paper, we examine the collaboration between statisticians and nonstatisticians from three different professional perspectives. Integrating these perspectives, we discuss ways to help the consulting statistician generate productive dialogue with clients. Finally, we examine how universities can better prepare students for careers in statistical consulting by incorporating more communication-based elements into their curriculum and by offering students ample opportunities to collaborate with nonstatisticians. Overall, we designed this exercise to help the consulting statistician generate dialogue with clients that results in more productive collaborations and a more satisfying work experience.Statistics, Bioinformatics, Medicinebe2166, dm2418, hhn2108, ew2320StatisticsArticles[Least Angle Regression]: Discussion
http://academiccommons.columbia.edu/catalog/ac:173841
Madigan, David B.; Ridgeway, Greghttp://dx.doi.org/10.7916/D81V5C29Tue, 13 May 2014 00:00:00 +0000Algorithms for simultaneous shrinkage and selection in regression and classification provide attractive solutions to knotty old statistical challenges. Nevertheless, as far as we can tell, Tibshirani's Lasso algorithm has had little impact on statistical practice. Two particular reasons for this may be the relative inefficiency of the original Lasso algorithm and the relative complexity of more recent Lasso algorithms [e.g., Osborne, Presnell and Turlach (2000)]. Efron, Hastie, Johnstone and Tibshirani have provided an efficient, simple algorithm for the Lasso as well as algorithms for stagewise regression and the new least angle regression. As such this paper is an important contribution to statistical computing.Mathematics, Statisticsdm2418StatisticsArticlesA Note on Equivalence Classes of Directed Acyclic Independence Graphs
http://academiccommons.columbia.edu/catalog/ac:173826
Madigan, David B.http://dx.doi.org/10.7916/D8TB150CTue, 13 May 2014 00:00:00 +0000Directed acyclic independence graphs (DAIGs) play an important role in recent developments in probabilistic expert systems and influence diagrams (Chyu [1]). The purpose of this note is to show that DAIGs can usefully be grouped into equivalence classes where the members of a single class share identical Markov properties. These equivalence classes can be identified via a simple graphical criterion. This result is particularly relevant to model selection procedures for DAIGs (see, e.g., Cooper and Herskovits [2] and Madigan and Raftery [4]) because it reduces the problem of searching among possible orientations of a given graph to that of searching among the equivalence classes.Mathematics, Statisticsdm2418StatisticsArticlesFit GFuseTLP penalized conditional logistic regression model for high-dimensional one-to- one matched case-control data
http://academiccommons.columbia.edu/catalog/ac:174087
Zhou, Hui; Wang, Shuang; Zheng, Tianhttp://dx.doi.org/10.7916/D8028PNJMon, 12 May 2014 00:00:00 +0000Fit GFuseTLP penalized conditional logistic regression model for high-dimensional one-to- one matched case-control dataStatisticshz2240, sw2206, tz33Statistics, BiostatisticsComputer softwareUnderstanding the Nature of Stellar Chemical Abundance Distributions in Nearby Stellar Systems
http://academiccommons.columbia.edu/catalog/ac:173510
Lee, Duane Morrishttp://dx.doi.org/10.7916/D84747X6Fri, 25 Apr 2014 00:00:00 +0000Since stars retain signatures of their galactic origins in their chemical compositions, we can exploit the chemical abundance distributions that we observe in stellar systems to put constraints on the nature of their progenitors. In this thesis, I present results from three projects aimed at understanding how high resolution spectroscopic observations of nearby stellar systems might be interpreted. The first project presents one possible explanation for the origin of peculiar abundance distributions observed in ultra-faint dwarf satellites of the Milky Way. The second project explores to what extent the distribution of chemical elements in the stellar halo can be used to trace Galactic accretion history from the birth of the Galaxy to the present day. Finally, a third project focuses on developing an input optimization algorithm for the second project to produce better estimates of halo accretion histories. In conclusion, I propose some other new ways to use statistical models and techniques along with chemical abundance distribution data to uncover galactic histories.Astronomy, Statistics, Nuclear chemistryAstronomyDissertationsInteraction-Based Learning for High-Dimensional Data with Continuous Predictors
http://academiccommons.columbia.edu/catalog/ac:196647
Huang, Chien-Hsunhttp://dx.doi.org/10.7916/D8X928CHMon, 07 Apr 2014 00:00:00 +0000High-dimensional data, such as that relating to gene expression in microarray experiments, may contain substantial amount of useful information to be explored. However, the information, relevant variables and their joint interactions are usually diluted by noise due to a large number of non-informative variables. Consequently, variable selection plays a pivotal role for learning in high dimensional problems. Most of the traditional feature selection methods, such as Pearson's correlation between response and predictors, stepwise linear regressions and LASSO are among the popular linear methods. These methods are effective in identifying linear marginal effect but are limited in detecting non-linear or higher order interaction effects. It is well known that epistasis (gene - gene interactions) may play an important role in gene expression where unknown functional forms are difficult to identify. In this thesis, we propose a novel nonparametric measure to first screen and do feature selection based on information from nearest neighborhoods. The method is inspired by Lo and Zheng's earlier work (2002) on detecting interactions for discrete predictors. We apply a backward elimination algorithm based on this measure which leads to the identification of many in influential clusters of variables. Those identified groups of variables can capture both marginal and interactive effects. Second, each identified cluster has the potential to perform predictions and classifications more accurately. We also study procedures how to combine these groups of individual classifiers to form a final predictor. Through simulation and real data analysis, the proposed measure is capable of identifying important variable sets and patterns including higher-order interaction sets. The proposed procedure outperforms existing methods in three different microarray datasets. Moreover, the nonparametric measure is quite flexible and can be easily extended and applied to other areas of high-dimensional data and studies.Statistics, Machine learning--Statistical methods, Epistasis (Genetics), Instrumental variables (Statistics), Nonparametric statistics, Cluster analysisch2526StatisticsDissertationsA Point Process Model for the Dynamics of Limit Order Books
http://academiccommons.columbia.edu/catalog/ac:171221
Vinkovskaya, Ekaterinahttp://dx.doi.org/10.7916/D88913WWFri, 28 Feb 2014 00:00:00 +0000This thesis focuses on the statistical modeling of the dynamics of limit order books in electronic equity markets. The statistical properties of events affecting a limit order book -market orders, limit orders and cancellations- reveal strong evidence of clustering in time, cross-correlation across event types and dependence of the order flow on the bid-ask spread. Further investigation reveals the presence of a self-exciting property - that a large number of events in a given time period tends to imply a higher probability of observing a large number of events in the following time period. We show that these properties may be adequately represented by a multivariate self-exciting point process with multiple regimes that reflect changes in the bid-ask spread. We propose a tractable parametrization of the model and perform a Maximum Likelihood Estimation of the model using high-frequency data from the Trades and Quotes database for US stocks. We show that the model may be used to obtain predictions of order flow and that its predictive performance beats the Poisson model as well as Moving Average and Auto Regressive time series models.StatisticsStatisticsDissertationsMixed Methods for Mixed Models
http://academiccommons.columbia.edu/catalog/ac:169644
Dorie, Vincent J.http://dx.doi.org/10.7916/D8V40S5XWed, 22 Jan 2014 00:00:00 +0000This work bridges the frequentist and Bayesian approaches to mixed models by borrowing the best features from both camps: point estimation procedures are combined with priors to obtain accurate, fast inference while posterior simulation techniques are developed that approximate the likelihood with great precision for the purposes of assessing uncertainty. These allow flexible inferences without the need to rely on expensive Markov chain Monte Carlo simulation techniques. Default priors are developed and evaluated in a variety of simulation and real-world settings with the end result that we propose a new set of standard approaches that yield superior performance at little computational cost.StatisticsStatisticsDissertationsMathematical Representations of Development Theories
http://academiccommons.columbia.edu/catalog/ac:168029
Singer, Burton; Spilerman, Seymour; Nesselroade, John R.; Boltes, Paul B.http://dx.doi.org/10.7916/D8NP22DSFri, 06 Dec 2013 00:00:00 +0000In this chapter we explore the consequences of particular stage linkage structures for the evolution of a population. We first argue the importance of constructing dynamic models of development theories and show the implications of various stage connections for population movements. A second focus concerns inverse problems: How the stage linkage structure may be recovered from survey data of the kind collected by developmental psychologists.Developmental psychology, Statisticsss50SociologyBook chaptersLearning to Believe in Sunspots
http://academiccommons.columbia.edu/catalog/ac:167710
Woodford, Michaelhttp://dx.doi.org/10.7916/D85X26VBMon, 25 Nov 2013 00:00:00 +0000An adaptive learning rule is exhibited for the Azariadis (1981) overlapping generations model of a monetary economy with multiple equilibria, under which the economy may converge to a stationary sunspot equilibrium, even if agents do not initially believe that outcomes are significantly different in different "sunspot" states. The type of learning rule studied is of the "stochastic approximation" form studied by Robbins and Monro (1951); methods for analyzing the convergence of this form of algorithm are presented that may be of use in many other contexts as well. Conditions are given under which convergence to a sunspot equilibrium occurs with probability one.Economics, Economic theory, Statisticsmw2230EconomicsArticlesSample Palomar Transient Factory light curves
http://academiccommons.columbia.edu/catalog/ac:167874
Price-Whelan, Adrian Michael; Agüeros, Marcel Andre; Fournier, Amanda P.; Street, Rachel; Ofek, Eran O.; Covey, Kevin R.; Levitan, David; Laher, Russ R.; Sesar, Branimir; Surace, Jasonhttp://dx.doi.org/10.7916/D8CF9N1NMon, 25 Nov 2013 00:00:00 +0000These light curves are made available to the public as part of the publication of our recent paper, "Statistical Searches for Microlensing Events in Large, Non-Uniformly Sampled Time-Domain Surveys: A Test Using Palomar Transient Factory Data." We have selected ~10,000 light curves from the Palomar Transient Factory database that can be used to test the various statistical tools described in the paper.Astronomy, Statisticsamp2217, maa17AstronomyDatasetsProspect Theory as Efficient Perceptual Distortion
http://academiccommons.columbia.edu/catalog/ac:167407
Woodford, Michaelhttp://dx.doi.org/10.7916/D8T43R03Thu, 21 Nov 2013 00:00:00 +0000The paper proposes a theory of efficient perceptual distortions, in which the statistical relation between subjective perceptions and the objective state minimizes the error of the state estimate, subject to a constraint on information processing capacity. The theory is shown to account for observed limits to the accuracy of visual perception, and then postulated to apply to perception of options in economic choice situations as well. When applied to choice between lotteries, it implies reference-dependent valuations, and predicts both risk-aversion with respect to gains and risk-seeking with respect to losses, as in the prospect theory of Kahneman and Tversky (1979).Statistics, Economic theory, Sociologymw2230EconomicsArticlesTwo Papers of Financial Engineering Relating to the Risk of the 2007--2008 Financial Crisis
http://academiccommons.columbia.edu/catalog/ac:167143
Zhong, Haowenhttp://dx.doi.org/10.7916/D8CC0XMGFri, 15 Nov 2013 00:00:00 +0000This dissertation studies two financial engineering and econometrics problems relating to two facets of the 2007-2008 financial crisis. In the first part, we construct the Spatial Capital Asset Pricing Model and the Spatial Arbitrage Pricing Theory to characterize the risk premiums of futures contracts on real estate assets. We also provide rigorous econometric analysis of the new models. Empirical study shows there exists significant spatial interaction among the S&P/Case-Shiller Home Price Index futures returns. In the second part, we perform empirical studies on the jump risk in the equity market. We propose a simple affine jump-diffusion model for equity returns, which seems to outperform existing ones (including models with Levy jumps) during the financial crisis and is at least as good during normal times, if model complexity is taken into account. In comparing the models, we made two empirical findings: (i) jump intensity seems to increase significantly during the financial crisis, while on average there appears to be little change of jump sizes; (ii) finite number of large jumps in returns for any finite time horizon seem to fit the data well both before and after the crisis.Operations research, Statisticshz2193Industrial Engineering and Operations ResearchDissertationsKernel-based association measures
http://academiccommons.columbia.edu/catalog/ac:167034
Liu, Yinghttp://hdl.handle.net/10022/AC:P:22154Thu, 07 Nov 2013 00:00:00 +0000Measures of associations have been widely used for describing the statistical relationships between two sets of variables. Traditional association measures tend to focus on specialized settings (specific types of variables or association patterns). Based on an in-depth summary of existing measures, we propose a general framework for association measures unifying existing methods and novel extensions based on kernels, including practical solutions to computational challenges. The proposed framework provides improved feature selection and extensions to a variety of current classifiers. Specifically, we introduce association screening and variable selection via maximizing kernel-based association measures. We also develop a backward dropping procedure for feature selection when there are a large number of candidate variables. We evaluate our framework using a wide variety of both simulated and real data. In particular, we conduct independence tests and feature selection using kernel association measures on diversified association patterns of different dimensions and variable types. The results show the superiority of our methods to existing ones. We also apply our framework to four real-word problems, three from statistical genetics and one of gender prediction from handwriting. We demonstrate through these applications both the de novo construction of new kernels and the adaptation of existing kernels tailored to the data at hand, and how kernel-based measures of associations can be naturally applied to different data structures including functional input and output spaces. This shows that our framework can be applied to a wide range of real world problems and work well in practice.Statistics, Computer scienceyl2802StatisticsDissertationsInference of functional neural connectivity and convergence acceleration methods
http://academiccommons.columbia.edu/catalog/ac:179409
Nikitchenko, Maxim V.http://hdl.handle.net/10022/AC:P:22052Thu, 31 Oct 2013 00:00:00 +0000The knowledge of the maps of neuronal interactions is key for system neuroscience, but at the moment we possess relatively little of it . The recent development of experimental methods which allow a simultaneous recording of the spiking activity, but not the intracellular voltage, of thousands of neurons gives us an opportunity to start filling that gap. In Chapter 2, I present a method for the inference of the parameters of the leaky integrate-and-fire (LIF) model featuring time-dependent currents and conductances based only on the extracellular recording of spiking in the network. The fitted parameters can describe the functional connections in the network, as well as the internal properties of the cells. The method can also be used to determine whether a single-compartment model of a neuron should include conductance- or current-based synapses, or their mixture. In addition, because the same mathematical model describes some of the flavors of the Drift Diffusion Model (DDM), popular in the studies of decision making process, the presented method can be readily used to fit their parameters. Making the proposed inference procedure -- based on the expectation-maximization (EM) algorithm -- accurate and robust, necessitated a development of a new numerical adaptive-grid (AG) method for the forward-backward (FB) propagation of the probability density, which is required in the computation of the sufficient statistic in the EM algorithm. These topics are covered in Chapter 3. Another issue which had to be addressed in order to obtain a usable inference algorithm is the well known slow convergence of the EM algorithm in the flat regions of the loglikelihood. Two complementary approaches to this issue are presented in this dissertation. In Chapter 4, I present a new framework for the acceleration of convergence of iterative algorithms (not limited to the EM) which unifies all previously known methods and allows us to construct a new method demonstrating the best performance of them all. To make the computations even faster, I wrote a Matlab package which allows them to be done in parallel on several machines and clusters. As one can see, all the aforementioned projects were sprouted up from one "head" project on the inference of the LIF model parameters. At the end of the dissertation, I briefly describe a disconnected project which is devoted to the development of a flexible experimental setup (software and hardware) for behavioral experiments, with a specific application to a particular type of the virtual Morris water maze experiment (VMWM).Neurosciences, Statisticsmvn2104Statistics, Neurobiology and BehaviorDissertationsLow-rank graphical models and Bayesian inference in the statistical analysis of noisy neural data
http://academiccommons.columbia.edu/catalog/ac:166472
Smith, Carl Alexanderhttp://hdl.handle.net/10022/AC:P:21991Fri, 11 Oct 2013 00:00:00 +0000We develop new methods of Bayesian inference, largely in the context of analysis of neuroscience data. The work is broken into several parts. In the first part, we introduce a novel class of joint probability distributions in which exact inference is tractable. Previously it has been difficult to find general constructions for models in which efficient exact inference is possible, outside of certain classical cases. We identify a class of such models that are tractable owing to a certain "low-rank" structure in the potentials that couple neighboring variables. In the second part we develop methods to quantify and measure information loss in analysis of neuronal spike train data due to two types of noise, making use of the ideas developed in the first part. Information about neuronal identity or temporal resolution may be lost during spike detection and sorting, or precision of spike times may be corrupted by various effects. We quantify the information lost due to these effects for the relatively simple but sufficiently broad class of Markovian model neurons. We find that decoders that model the probability distribution of spike-neuron assignments significantly outperform decoders that use only the most likely spike assignments. We also apply the ideas of the low-rank models from the first section to defining a class of prior distributions over the space of stimuli (or other covariate) which, by conjugacy, preserve the tractability of inference. In the third part, we treat Bayesian methods for the estimation of sparse signals, with application to the locating of synapses in a dendritic tree. We develop a compartmentalized model of the dendritic tree. Building on previous work that applied and generalized ideas of least angle regression to obtain a fast Bayesian solution to the resulting estimation problem, we describe two other approaches to the same problem, one employing a horseshoe prior and the other using various spike-and-slab priors. In the last part, we revisit the low-rank models of the first section and apply them to the problem of inferring orientation selectivity maps from noisy observations of orientation preference. The relevant low-rank model exploits the self-conjugacy of the von Mises distribution on the circle. Because the orientation map model is loopy, we cannot do exact inference on the low-rank model by the forward backward algorithm, but block-wise Gibbs sampling by the forward backward algorithm speeds mixing. We explore another von Mises coupling potential Gibbs sampler that proves to effectively smooth noisily observed orientation maps.Statistics, Neurosciencescas2207Statistics, ChemistryDissertationsGeneralized Volatility-Stabilized Processes
http://academiccommons.columbia.edu/catalog/ac:165162
Pickova, Radkahttp://hdl.handle.net/10022/AC:P:21616Fri, 13 Sep 2013 00:00:00 +0000In this thesis, we consider systems of interacting diffusion processes which we call Generalized Volatility-Stabilized processes, as they extend the Volatility-Stabilized Market models introduced in Fernholz and Karatzas (2005). First, we show how to construct a weak solution of the underlying system of stochastic differential equations. In particular, we express the solution in terms of time-changed squared-Bessel processes and argue that this solution is unique in distribution. In addition, we also discuss sufficient conditions under which this solution does not explode in finite time, and provide sufficient conditions for pathwise uniqueness and for existence of a strong solution. Secondly, we discuss the significance of these processes in the context of Stochastic Portfolio Theory. We describe specific market models which assume that the dynamics of the stocks' capitalizations is the same as that of the Generalized Volatility-Stabilized processes, and we argue that strong relative arbitrage opportunities may exist in these markets, specifically, we provide multiple examples of portfolios that outperform the market portfolio. Moreover, we examine the properties of market weights as well as the diversity weighted portfolio in these models. Thirdly, we provide some asymptotic results for these processes which allows us to describe different properties of the corresponding market models based on these processes.Statisticsrp2424Statistics, MathematicsDissertationsThe Representation of Social Processes by Markov Models
http://academiccommons.columbia.edu/catalog/ac:165054
Singer, Burton; Spilerman, Seymourhttp://hdl.handle.net/10022/AC:P:21574Thu, 12 Sep 2013 00:00:00 +0000In this paper we consider a class of issues which are central to modeling social phenomena by continuous-time Markov structures. In particular, we discuss (a) embeddability, or how to determine whether observations on an empirical process could have arisen via the evolution of a continuous-time Markov structure; and (b) identification, or what to do if the observations are consistent with more than one continuous-time Markov structure. With respect to the latter topic, we discuss how to select the specific structure from the list of alternatives which should be associated with the empirical process. We point out that the issues of embeddability and identification are especially pertinent to modeling empirical processes when one has available only fragmentary data and when the observations contain "noise" or other sources of error. These characteristics, of course, describe the typical work situation of sociologists. Finally, we note the type of situation in which a continuous-time model is the proper structure to employ and indicate that issues analogous to the ones we describe here apply to modeling social processes with discrete-time structures.Sociology, Statisticsss50SociologyArticlesThe Cognitive and Demographic Variables that Underlie Notetaking and Review in Mathematics: Does Quality of Notes Predict Test Performance in Mathematics?
http://academiccommons.columbia.edu/catalog/ac:163324
Belanfante, Elizabeth Andreahttp://hdl.handle.net/10022/AC:P:21089Tue, 16 Jul 2013 00:00:00 +0000Taking and reviewing lecture notes is an effective and prevalent method of studying employed by students at the post-secondary level (Armbruster, 2000; Armbruster, 2009; Dunkel and Davy, 1989; Peverly et al., 2009). However, few studies have examined the cognitive variables that underlie this skill. In addition, these studies have focused on more verbally based domains, such as history and psychology. The current study examined the practical utility of notes in actual class settings. It is the first study that has attempted to examine the outcomes and cognitive skills associated with note-taking and review in any area of mathematics. It also set out to establish the importance of quality of notes and quality of review sheets to test performance in graduate level probability and statistics courses. Finally, this dissertation sought to explore the extent to which variables besides notes also contribute to test performance in this domain. Participants included 74 graduate students enrolled in introductory probability and statistics courses at a private graduate teacher education college in a large city in the Northeast United States. Participants took notes during class and provided the researcher with a copy of their notes for several lectures. Participants were also required to write down additional information on the back of two formula sheets that were used as an aid on the midterm exam. The independent variables included handwriting speed, gender, spatial visualization ability, background knowledge, verbal ability, and working memory. The dependent variables were quality of lecture notes, quality of supplemental review sheets, and midterm performance. All measures were group administered. Results revealed that gender was the only predictor of quality of lecture notes. Quality of lecture notes was the only significant predictor of quality of supplemental review sheets. Neither quality of lecture notes nor quality of supplemental review sheets predicted overall test performance. Instead, background knowledge and instructor significantly predicted overall test performance. Handwriting speed was a marginally significant predictor of overall test performance. Future research aimed at replicating these findings and expanding the results to include other mathematical domains and educational levels is recommended.Mathematics, Statistics, Educationeab2111Health and Behavior Studies, School PsychologyDissertationsApplication of ordered latent class regression model in educational assessment
http://academiccommons.columbia.edu/catalog/ac:161911
Cha, Jisunghttp://hdl.handle.net/10022/AC:P:20599Thu, 06 Jun 2013 00:00:00 +0000Latent class analysis is a useful tool to deal with discrete multivariate response data. Croon (1990) proposed the ordered latent class model where latent classes are ordered by imposing inequality constraints on the cumulative conditional response probabilities. Taking stochastic ordering of latent classes into account in the analysis of data gives a meaningful interpretation, since the primary purpose of a test is to order students on the latent trait continuum. This study extends Croon's model to ordered latent class regression that regresses latent class membership on covariates (e.g., gender, country) and demonstrates the utilities of an ordered latent class regression model in educational assessment using data from Trends in International Mathematics and Science Study (TIMSS). The benefit of this model is that item analysis and group comparisons can be done simultaneously in one model. The model is fitted by maximum likelihood estimation method with an EM algorithm. It is found that the proposed model is a useful tool for exploratory purposes as a special case of nonparametric item response models and cross-country difference can be modeled as different composition of discrete classes. Simulations is done to evaluate the performance of information criteria (AIC and BIC) in selecting the appropriate number of latent classes in the model. From the simulation results, AIC outperforms BIC for the model with the order-restricted maximum likelihood estimator.Educational tests and measurements, Statistics, Mathematics educationjc2320Human Development, Measurement and EvaluationDissertationsPenalized Joint Maximum Likelihood Estimation Applied to Two Parameter Logistic Item Response Models
http://academiccommons.columbia.edu/catalog/ac:161745
Paolino, Jon-Paul Noelhttp://hdl.handle.net/10022/AC:P:20531Fri, 31 May 2013 00:00:00 +0000Item response theory (IRT) models are a conventional tool for analyzing both small scale and large scale educational data sets, and they are also used for the development of high-stakes tests such as the Scholastic Aptitude Test (SAT) and the Graduate Record Exam (GRE). When estimating these models it is imperative that the data set includes many more examinees than items, which is a similar requirement in regression modeling where many more observations than variables are needed. If this requirement has not been met the analysis will yield meaningless results. Recently, penalized estimation methods have been developed to analyze data sets that may include more variables than observations. The main focus of this study was to apply LASSO and ridge regression penalization techniques to IRT models in order to better estimate model parameters. The results of our simulations showed that this new estimation procedure called penalized joint maximum likelihood estimation provided meaningful estimates when IRT data sets included more items than examinees when traditional Bayesian estimation and marginal maximum likelihood methods were not appropriate. However, when the IRT datasets contained more examinees than items Bayesian estimation clearly outperformed both penalized joint maximum likelihood estimation and marginal maximum likelihood.Statisticsjnp2111Human Development, Measurement and EvaluationDissertationsStochastic Models of Limit Order Markets
http://academiccommons.columbia.edu/catalog/ac:161685
Kukanov, Arseniyhttp://hdl.handle.net/10022/AC:P:20511Thu, 30 May 2013 00:00:00 +0000During the last two decades most stock and derivatives exchanges in the world transitioned to electronic trading in limit order books, creating a need for a new set of quantitative models to describe these order-driven markets. This dissertation offers a collection of models that provide insight into the structure of modern financial markets, and can help to optimize trading decisions in practical applications. In the first part of the thesis we study the dynamics of prices, order flows and liquidity in limit order markets over short timescales. We propose a stylized order book model that predicts a particularly simple linear relation between price changes and order flow imbalance, defined as a difference between net changes in supply and demand. The slope in this linear relation, called a price impact coefficient, is inversely proportional in our model to market depth - a measure of liquidity. Our empirical results confirm both of these predictions. The linear relation between order flow imbalance and price changes holds for time intervals between 50 milliseconds and 5 minutes. The inverse relation between the price impact coefficient and market depth holds on longer timescales. These findings shed a new light on intraday variations in market volatility. According to our model volatility fluctuates due to changes in market depth or in order flow variance. Previous studies also found a positive correlation between volatility and trading volume, but in order-driven markets prices are determined by the limit order book activity, so the association between trading volume and volatility is unclear. We show how a spurious correlation between these variables can indeed emerge in our linear model due to time aggregation of high-frequency data. Finally, we observe short-term positive autocorrelation in order flow imbalance and discuss an application of this variable as a measure of adverse selection in limit order executions. Our results suggest that monitoring recent order flow can improve the quality of order executions in practice. In the second part of the thesis we study the problem of optimal order placement in a fragmented limit order market. To execute a trade, market participants can submit limit orders or market orders across various exchanges where a stock is traded. In practice these decisions are influenced by sizes of order queues and by statistical properties of order flows in each limit order book, and also by rebates that exchanges pay for limit order submissions. We present a realistic model of limit order executions and formalize the search for an optimal order placement policy as a convex optimization problem. Based on this formulation we study how various factors determine investor's order placement decisions. In a case when a single exchange is used for order execution, we derive an explicit formula for the optimal limit and market order quantities. Our solution shows that the optimal split between market and limit orders largely depends on one's tolerance to execution risk. Market orders help to alleviate this risk because they execute with certainty. Correspondingly, we find that an optimal order allocation shifts to these more expensive orders when the execution risk is of primary concern, for example when the intended trade quantity is large or when it is costly to catch up on the quantity after limit order execution fails. We also characterize the optimal solution in the general case of simultaneous order placement on multiple exchanges, and show that it sets execution shortfall probabilities to specific threshold values computed with model parameters. Finally, we propose a non-parametric stochastic algorithm that computes an optimal solution by resampling historical data and does not require specifying order flow distributions. A numerical implementation of this algorithm is used to study the sensitivity of an optimal solution to changes in model parameters. Our numerical results show that order placement optimization can bring a substantial reduction in trading costs, especially for small orders and in cases when order flows are relatively uncorrelated across trading venues. The order placement optimization framework developed in this thesis can also be used to quantify the costs and benefits of financial market fragmentation from the point of view of an individual investor. For instance, we find that a positive correlation between order flows, which is empirically observed in a fragmented U.S. equity market, increases the costs of trading. As the correlation increases it may become more expensive to trade in a fragmented market than it is in a consolidated market. In the third part of the thesis we analyze the dynamics of limit order queues at the best bid or ask of an exchange. These queues consist of orders submitted by a variety of market participants, yet existing order book models commonly assume that all orders have similar dynamics. In practice, some orders are submitted by trade execution algorithms in an attempt to buy or sell a certain quantity of assets under time constraints, and these orders are canceled if their realized waiting time exceeds a patience threshold. In contrast, high-frequency traders submit and cancel orders depending on the order book state and their orders are not driven by patience. The interaction between these two order types within a single FIFO queue leads bursts of order cancelations for small queues and anomalously long waiting times in large queues. We analyze a fluid model that describes the evolution of large order queues in liquid markets, taking into account the heterogeneity between order submission and cancelation strategies of different traders. Our results show that after a finite initial time interval, the queue reaches a specific structure where all orders from high-frequency traders stay in the queue until execution but most orders from execution algorithms exceed their patience thresholds and are canceled. This "order crowding" effect has been previously noted by participants in highly liquid stock and futures markets and was attributed to a large participation of high-frequency traders. In our model, their presence creates an additional workload, which increases queue waiting times for new orders. Our analysis of the fluid model leads to waiting time estimates that take into account the distribution of order types in a queue. These estimates are tested against a large dataset of realized limit order waiting times collected by a U.S. equity brokerage firm. The queue composition at a moment of order submission noticeably affects its waiting time and we find that assuming a single order type for all orders in the queue leads to unrealistic results. Estimates that assume instead a mix of heterogeneous orders in the queue are closer to empirical data. Our model for a limit order queue with heterogeneous order types also appears to be interesting from a methodological point of view. It introduces a new type of behavior in a queueing system where one class of jobs has state-dependent dynamics, while others are driven by patience. Although this model is motivated by the analysis of limit order books, it may find applications in studying other service systems with state-dependent abandonments.Operations research, Finance, Statisticsak2870Industrial Engineering and Operations ResearchDissertations