Academic Commons Search Results
http://academiccommons.columbia.edu/catalog.rss?f%5Bdepartment_facet%5D%5B%5D=Statistics&q=&rows=500&sort=record_creation_date+desc
Academic Commons Search Resultsen-usStatistical Inference and Experimental Design for Q-matrix Based Cognitive Diagnosis Models
http://academiccommons.columbia.edu/catalog/ac:176169
Zhang, Stephaniehttp://dx.doi.org/10.7916/D8TQ5ZP5Mon, 07 Jul 2014 11:50:59 +0000There has been growing interest in recent years in using cognitive diagnosis models for diagnostic measurement, i.e., classification according to multiple discrete latent traits. The Q-matrix, an incidence matrix specifying the presence or absence of a relationship between each item in the assessment and each latent attribute, is central to many of these models. Important applications include educational and psychological testing; demand in education, for example, has been driven by recent focus on skills-based evaluation. However, compared to more traditional models coming from classical test theory and item response theory, cognitive diagnosis models are relatively undeveloped and suffer from several issues limiting their applicability. This thesis exams several issues related to statistical inference and experimental design for Q-matrix based cognitive diagnosis models.
We begin by considering one of the main statistical issues affecting the practical use of Q-matrix based cognitive diagnosis models, the identifiability issue. In statistical models, identifiability is prerequisite for most common statistical inferences, including parameter estimation and hypothesis testing. With Q-matrix based cognitive diagnosis models, identifiability also affects the classification of respondents according to their latent traits. We begin by examining the identifiability of model parameters, presenting necessary and sufficient conditions for identifiability in several settings.
Depending on the area of application and the researcher's degree of control over the experiment design, fulfilling these identifiability conditions may be difficult. The second part of this thesis proposes new methods for parameter estimation and respondent classification for use with non-identifiable models. In addition, our framework allows consistent estimation of the severity of the non-identifiability problem, in terms of the proportion of the population affected by it. The implications of this measure for the design of diagnostic assessments are also discussed.Statistics, Educational tests and measurements, Quantitative psychology and psychometricsStatisticsDissertationsUnbiased Penetrance Estimates with Unknown Ascertainment Strategies
http://academiccommons.columbia.edu/catalog/ac:175879
Gore, Kristenhttp://dx.doi.org/10.7916/D8KP8098Mon, 07 Jul 2014 11:39:52 +0000Allelic variation in the genome leads to variation in individuals' production of proteins. This, in turn, leads to variation in traits and development, and, in some cases, to diseases. Understanding the genetic basis for disease can aid in the search for therapies and in guiding genetic counseling. Thus, it is of interest to discover the genes with mutations responsible for diseases and to understand the impact of allelic variation at those genes.
A subject's genetic composition is commonly referred to as the subject's genotype. Subjects who carry the gene mutation of interests are referred to as carriers. Subjects who are afflicted with a disease under study (that is, subjects who exhibit the phenotype) are termed affected carriers. The age-specific probability that a given subject will exhibit a phenotype of interest, given mutation status at a gene is known as penetrance.
Understanding penetrance is an important facet of genetic epidemiology. Penetrance estimates are typically calculated via maximum likelihood from family data. However, penetrance estimates can be biased if the nature of the sampling strategy is not correctly reflected in the likelihood. Unfortunately, sampling of family data may be conducted in a haphazard fashion or, even if conducted systematically, might be reported in an incomplete fashion. Bias is possible in applying likelihood methods to reported data if (as is commonly the case) some unaffected family members are not represented in the reports.
The purpose here is to present an approach to find efficient and unbiased penetrance estimates in cases where there is incomplete knowledge of the sampling strategy and incomplete information on the full pedigree structure of families included in the data. The method may be applied with different conjectural assumptions about the ascertainment strategy to balance the possibly biasing effects of wishful assumptions about the sampling strategy with the efficiency gains that could be obtained through valid assumptions.StatisticsStatisticsDissertationsToward Reproducible Computational Research: An Empirical Analysis of Data and Code Policy Adoption by Journals
http://academiccommons.columbia.edu/catalog/ac:174140
Stodden, Victoria C.; Guo, Peixuan; Ma, Zhaokunhttp://dx.doi.org/10.7916/D80K26NNWed, 21 May 2014 11:58:15 +0000Journal policy on research data and code availability is an important part of the ongoing shift toward publishing reproducible computational science. This article extends the literature by studying journal data sharing policies by year (for both 2011 and 2012) for a referent set of 170 journals. We make a further contribution by evaluating code sharing policies, supplemental materials policies, and open access status for these 170 journals for each of 2011 and 2012. We build a predictive model of open data and code policy adoption as a function of impact factor and publisher and find higher impact journals more likely to have open data and code policies and scientific societies more likely to have open data and code policies than commercial publishers. We also find open data policies tend to lead open code policies, and we find no relationship between open data and code policies and either supplemental material policies or open access journal status. Of the journals in this study, 38% had a data policy, 22% had a code policy, and 66% had a supplemental materials policy as of June 2012. This reflects a striking one year increase of 16% in the number of data policies, a 30% increase in code policies, and a 7% increase in the number of supplemental materials policies. We introduce a new dataset to the community that categorizes data and code sharing, supplemental materials, and open access policies in 2011 and 2012 for these 170 journals.Technical communication, Information sciencevcs2115, zm2168StatisticsArticlesBook Reviews: Principles of Data Mining. By David Hand, Heikki Mannila, and Padhraic Smyth.
http://academiccommons.columbia.edu/catalog/ac:173915
Madigan, David B.http://dx.doi.org/10.7916/D8DZ06D8Thu, 15 May 2014 12:45:12 +0000"Principles of Data Mining. By David Hand, Heikki Mannila, and Padhraic Smyth. MIT Press, Cambridge, MA, 2001. $50.00. xxxii+546 pp., hardcover. ISBN 0-262-08290-X. Is data mining the same as statistics? The distinguished authors of Principles of Data Mining struggle to make a distinction between the two subjects. In the end, what they have written is a fine applied statistics text." -- page 501Statisticsdm2418StatisticsReviewsMedication-Wide Association Studies
http://academiccommons.columbia.edu/catalog/ac:173912
Ryan, P. B.; Stang, P. E.; Madigan, David B.; Schuemie, M. J.; Hripcsak, George M.http://dx.doi.org/10.7916/D8PG1PVXThu, 15 May 2014 12:30:39 +0000Undiscovered side effects of drugs can have a profound effect on the health of the nation, and electronic health-care databases offer opportunities to speed up the discovery of these side effects. We applied a “medication-wide association study” approach that combined multivariate analysis with exploratory visualization to study four health outcomes of interest in an administrative claims database of 46 million patients and a clinical database of 11 million patients. The technique had good predictive value, but there was no threshold high enough to eliminate false-positive findings. The visualization not only highlighted the class effects that strengthened the review of specific products but also underscored the challenges in confounding. These findings suggest that observational databases are useful for identifying potential associations that warrant further consideration but are unlikely to provide definitive evidence of causal effects.Pharmacology, Statistics, Bioinformaticsdm2418, gh13Statistics, Biomedical InformaticsArticlesAlgorithms for Sparse Linear Classifiers in the Massive Data Setting
http://academiccommons.columbia.edu/catalog/ac:173908
Balakrishnan, Suhrid; Bartlett, Peter; Madigan, David B.http://dx.doi.org/10.7916/D8Z0368XThu, 15 May 2014 12:25:33 +0000Classifiers favoring sparse solutions, such as support vector machines, relevance vector machines, LASSO-regression based classifiers, etc., provide competitive methods for classification problems in high dimensions. However, current algorithms for training sparse classifiers typically scale quite unfavorably with respect to the number of training examples. This paper proposes online and multi-pass algorithms for training sparse linear classifiers for high dimensional data. These algorithms have computational complexity and memory requirements that make learning on massive data sets feasible. The central idea that makes this possible is a straightforward quadratic approximation to the likelihood function.Statistics, Artificial intelligencedm2418StatisticsArticlesLearning Theory Analysis for Association Rules and Sequential Event Prediction
http://academiccommons.columbia.edu/catalog/ac:173905
Rudin, Cynthia; Letham, Benjamin; Madigan, David B.http://dx.doi.org/10.7916/D82N50C1Thu, 15 May 2014 12:19:33 +0000We present a theoretical analysis for prediction algorithms based on association rules. As part of this analysis, we introduce a problem for which rules are particularly natural, called “sequential event prediction." In sequential event prediction, events in a sequence are revealed one by one, and the goal is to determine which event will next be revealed. The training set is a collection of past sequences of events. An example application is to predict which item will next be placed into a customer's online shopping cart, given his/her past purchases. In the context of this problem, algorithms based on association rules have distinct advantages over classical statistical and machine learning methods: they look at correlations based on subsets of co-occurring past events (items a and b imply item c), they can be applied to the sequential event prediction problem in a natural way, they can potentially handle the “cold start" problem where the training set is small, and they yield interpretable predictions. In this work, we present two algorithms that incorporate association rules. These algorithms can be used both for sequential event prediction and for supervised classification, and they are simple enough that they can possibly be understood by users, customers, patients, managers, etc. We provide generalization guarantees on these algorithms based on algorithmic stability analysis from statistical learning theory. We include a discussion of the strict minimum support threshold often used in association rule mining, and introduce an “adjusted confidence" measure that provides a weaker minimum support condition that has advantages over the strict minimum support. The paper brings together ideas from statistical learning theory, association rule mining and Bayesian analysis.Statistics, Artificial intelligencedm2418StatisticsArticlesAnalysis of Variance of Cross-Validation Estimators of the Generalization Error
http://academiccommons.columbia.edu/catalog/ac:173902
Markatou, Marianthi; Tian, Hong; Biswas, Shameek; Hripcsak, George M.http://dx.doi.org/10.7916/D86D5R2XThu, 15 May 2014 11:58:33 +0000This paper brings together methods from two different disciplines: statistics and machine learning. We address the problem of estimating the variance of cross-validation (CV) estimators of the generalization error. In particular, we approach the problem of variance estimation of the CV estimators of generalization error as a problem in approximating the moments of a statistic. The approximation illustrates the role of training and test sets in the performance of the algorithm. It provides a unifying approach to evaluation of various methods used in obtaining training and test sets and it takes into account the variability due to different training and test sets. For the simple problem of predicting the sample mean and in the case of smooth loss functions, we show that the variance of the CV estimator of the generalization error is a function of the moments of the random variables Y=Card(Sj ∩ Sj') and Y*=Card(Sjc ∩ Sj'c), where Sj, Sj' are two training sets, and Sjc, Sj'c are the corresponding test sets. We prove that the distribution of Y and Y* is hypergeometric and we compare our estimator with the one proposed by Nadeau and Bengio (2003). We extend these results in the regression case and the case of absolute error loss, and indicate how the methods can be extended to the classification case. We illustrate the results through simulation.Statistics, Artificial intelligencemm168, ht2031, spb2003, gh13Biostatistics, Biomedical Informatics, StatisticsArticlesA One-Pass Sequential Monte Carlo Method for Bayesian Analysis of Massive Datasets
http://academiccommons.columbia.edu/catalog/ac:173899
Balakrishnan, Suhrid; Madigan, David B.http://dx.doi.org/10.7916/D8B56GTPThu, 15 May 2014 11:51:51 +0000For Bayesian analysis of massive data, Markov chain Monte Carlo (MCMC) techniques often prove infeasible due to computational resource constraints. Standard MCMC methods generally require a complete scan of the dataset for each iteration. Ridgeway and Madigan (2002) and Chopin (2002b) recently presented importance sampling algorithms that combined simulations from a posterior distribution conditioned on a small portion of the dataset with a reweighting of those simulations to condition on the remainder of the dataset. While these algorithms drastically reduce the number of data accesses as compared to traditional MCMC, they still require substantially more than a single pass over the dataset. In this paper, we present "1PFS," an efficient, one-pass algorithm. The algorithm employs a simple modification of the Ridgeway and Madigan (2002) particle filtering algorithm that replaces the MCMC based "rejuvenation" step with a more efficient "shrinkage" kernel smoothing based step. To show proof-of-concept and to enable a direct comparison, we demonstrate 1PFS on the same examples presented in Ridgeway and Madigan (2002), namely a mixture model for Markov chains and Bayesian logistic regression. Our results indicate the proposed scheme delivers accurate parameter estimates while employing only a single pass through the data.Mathematics, Statisticsdm2418StatisticsArticlesA Characterization of Markov Equivalence Classes for Acyclic Digraphs
http://academiccommons.columbia.edu/catalog/ac:173896
Andersson, Steen A.; Madigan, David B.; Perlman, Michael D.http://dx.doi.org/10.7916/D8FX77J3Thu, 15 May 2014 11:28:36 +0000Undirected graphs and acyclic digraphs (ADG's), as well as their mutual extension to chain graphs, are widely used to describe dependencies among variables in multiviarate distributions. In particular, the likelihood functions of ADG models admit convenient recursive factorizations that often allow explicit maximum likelihood estimates and that are well suited to building Bayesian networks for expert systems. Whereas the undirected graph associated with a dependence model is uniquely determined, there may be many ADG's that determine the same dependence (i.e., Markov) model. Thus, the family of all ADG's with a given set of vertices is naturally partitioned into Markov-equivalence classes, each class being associated with a unique statistical model. Statistical procedures, such as model selection of model averaging, that fail to take into account these equivalence classes may incur substantial computational or other inefficiences. Here it is show that each Markov-equivalence class is uniquely determined by a single chain graph, the essential graph, that is itself simultaneously Markov equivalent to all ADG's in the equivalence class. Essential graphs are characterized, a polynomial-time algorithm for their construction is given, and their applications to model selection and other statistical questions are described.Mathematics, Statistics, Theoretical mathematicsdm2418StatisticsArticlesCorrection: Separation and completeness properties for AMP chain graph Markov models
http://academiccommons.columbia.edu/catalog/ac:173887
Madigan, David B.; Levitz, Michael; Perlman, Michael D.http://dx.doi.org/10.7916/D8QF8R05Wed, 14 May 2014 19:42:28 +0000Correction of table 2 on page 1757 of 'Separation and completeness properties for AMP chain graph Markov models', Annals of Statistics, volume 29 (2001).Mathematics, Statisticsdm2418StatisticsArticlesBayesian Hierarchical Rule Modeling for Predicting Medical Conditions
http://academiccommons.columbia.edu/catalog/ac:173882
McCormick, Tyler H.; Rudin, Cynthia; Madigan, David B.http://dx.doi.org/10.7916/D8V69GP1Wed, 14 May 2014 19:02:36 +0000We propose a statistical modeling technique, called the Hierarchical Association Rule Model (HARM), that predicts a patient’s possible future medical conditions given the patient’s current and past history of reported conditions. The core of our technique is a Bayesian hierarchical model for selecting predictive association rules (such as “condition 1 and condition 2 → condition 3”) from a large set of candidate rules. Because this method “borrows strength” using the conditions of many similar patients, it is able to provide predictions specialized to any given patient, even when little information about the patient’s history of conditions is available.Applied mathematics, Statistics, Medicinedm2418StatisticsArticles[Bayesian Analysis in Expert Systems]: Comment: What's Next?
http://academiccommons.columbia.edu/catalog/ac:173856
Madigan, David B.http://dx.doi.org/10.7916/D8W37TFJTue, 13 May 2014 17:59:40 +0000"These papers represent two of the many different graphical modeling camps that have emerged from a flurry of activity in the past decade. The paper by Cox and Wermuth falls within the statistical graphical modeling camp and provides a useful generalization of that body of work. There is, of course, a price to be paid for this generality, namely that the interpretation of the graphs is more complex...The paper by Spiegelhalter, Dawid, Lauritzen and Cowell falls within the probabilistic expert system camp. This is a tour de force by researchers responsible for much of the astonishing progress in this area. Ten years ago, probabilistic models were shunned by the artificial intelligence community. That they are now widely accepted and used is due in large measure to the insights and efforts of these authors, along with other pioneers such as Judea Pearl and Peter Cheeseman..." -- page 261Mathematics, Statisticsdm2418StatisticsArticlesBayesian Model Averaging: a Tutorial (with Comments by M. Clyde, David Draper and E. I. George, and a Rejoinder by the Authors)
http://academiccommons.columbia.edu/catalog/ac:173853
Hoeting, Jennifer A.; Madigan, David B.; Raftery, Adrian E.; Volinsky, Chris T.; Clyde, M.; Draper, David; George, E. I.http://dx.doi.org/10.7916/D84M92N7Tue, 13 May 2014 17:39:49 +0000Standard statistical practice ignores model uncertainty. Data analysts typically select a model from some class of models and then proceed as if the selected model had generated the data. This approach ignores the uncertainty in model selection, leading to over-confident inferences and decisions that are more risky than one thinks they are. Bayesian model averaging (BMA)provides a coherent mechanism for accounting for this model uncertainty. Several methods for implementing BMA have recently emerged. We discuss these methods and present a number of examples.In these examples, BMA provides improved out-of-sample predictive performance. We also provide a catalogue of currently available BMA software.Statisticsdm2418StatisticsArticles[A Report on the Future of Statistics]: Comment
http://academiccommons.columbia.edu/catalog/ac:173850
Madigan, David B.; Stuetzle, Wernerhttp://dx.doi.org/10.7916/D8D50K3VTue, 13 May 2014 17:28:46 +0000"Extraordinary opportunities for statistical ideas and for statisticians now present themselves. However, to take advantage of the opportunities, statistics has to change the way in which it recruits and trains students. Statistics has primarily focused on squeezing the maximum amount of information out of limited data. This paradigm is rapidly diminishing in importance and statistics education finds itself out of step with reality. The problems begin at the high school and undergraduate levels, where the standard course includes a narrow set of pre-computing-era topics. At the graduate level, the typical statistics program suffers from the same problem..." -- page 408Mathematics education, Higher educationdm2418StatisticsArticlesSeparation and Completeness Properties for Amp Chain Graph Markov Models
http://academiccommons.columbia.edu/catalog/ac:173847
Levitz, Michael; Perlman, Michael D.; Madigan, David B.http://dx.doi.org/10.7916/D8X34VJGTue, 13 May 2014 16:30:46 +0000Pearl’s well-known d-separation criterion for an acyclic directed graph (ADG) is a pathwise separation criterion that can be used to efficiently identify all valid conditional independence relations in the Markov model determined by the graph. This paper introduces p-separation, a pathwise separation criterion that efficiently identifies all valid conditional independences under the Andersson–Madigan–Perlman (AMP) alternative Markov property for chain graphs (= adicyclic graphs), which include both ADGs and undirected graphs as special cases. The equivalence of p-separation to the augmentation criterion occurring in the AMP global Markov property is established, and p-separation is applied to prove completeness of the global Markov property for AMP chain graph models. Strong completeness of the AMP Markov property is established, that is, the existence of Markov perfect distributions that satisfy those and only those conditional independences implied by the AMP property(equivalently, by p-separation). A linear-time algorithm for determining p-separation is presented.Mathematics, Statistics, Theoretical mathematicsdm2418StatisticsArticles[Least Angle Regression]: Discussion
http://academiccommons.columbia.edu/catalog/ac:173841
Madigan, David B.; Ridgeway, Greghttp://dx.doi.org/10.7916/D81V5C29Tue, 13 May 2014 16:15:23 +0000Algorithms for simultaneous shrinkage and selection in regression and classification provide attractive solutions to knotty old statistical challenges. Nevertheless, as far as we can tell, Tibshirani's Lasso algorithm has had little impact on statistical practice. Two particular reasons for this may be the relative inefficiency of the original Lasso algorithm and the relative complexity of more recent Lasso algorithms [e.g., Osborne, Presnell and Turlach (2000)]. Efron, Hastie, Johnstone and Tibshirani have provided an efficient, simple algorithm for the Lasso as well as algorithms for stagewise regression and the new least angle regression. As such this paper is an important contribution to statistical computing.Mathematics, Statisticsdm2418StatisticsArticlesA Hierarchical Model for Association Rule Mining of Sequential Events: An Approach to Automated Medical Symptom Prediction
http://academiccommons.columbia.edu/catalog/ac:173838
McCormick, Tyler H.; Rudin, Cynthia; Madigan, David B.http://dx.doi.org/10.7916/D89C6VJDTue, 13 May 2014 15:27:01 +0000In many healthcare settings, patients visit healthcare professionals periodically and report multiple medical conditions, or symptoms, at each encounter. We propose a statistical modeling technique, called the Hierarchical Association Rule Model (HARM), that predicts a patient’s possible future symptoms given the patient’s current and past history of reported symptoms. The core of our technique is a Bayesian hierarchical model for selecting predictive association rules (such as “symptom 1 and symptom 2 → symptom 3 ”) from a large set of candidate rules. Because this method “borrows strength” using the symptoms of many similar patients, it is able to provide predictions specialized to any given patient, even when little information about the patient’s history of symptoms is available.Mathematics, Statistics, Medicinedm2418StatisticsArticlesGenerating Productive Dialogue between Consulting Statisticians and their Clients in the Pharmaceutical and Medical Research Settings
http://academiccommons.columbia.edu/catalog/ac:173832
Emir, Birol; Amaratunga, Dhammika; Beltangady, Mohan; Cabrera, Javier; Freeman, Roy; Madigan, David B.; Nguyen, Ha H.; Whalen, Edward Patrickhttp://dx.doi.org/10.7916/D8PK0D8NTue, 13 May 2014 15:09:40 +0000Due to the ever-increasing complexity of scientific technologies and resulting data, consulting statisticians are becoming more involved in the design, conduct, and analysis of biomedical research. This requires extensive collaboration between the consulting statistician and nonstatisticians, such as researchers, clinicians, and corporate executives. Consequently, a successful consulting career is becoming ever more dependent on the statistician's ability to effectively communicate with nonstatisticians. This is especially true when more complex, nontraditional analytical methods are required. In this paper, we examine the collaboration between statisticians and nonstatisticians from three different professional perspectives. Integrating these perspectives, we discuss ways to help the consulting statistician generate productive dialogue with clients. Finally, we examine how universities can better prepare students for careers in statistical consulting by incorporating more communication-based elements into their curriculum and by offering students ample opportunities to collaborate with nonstatisticians. Overall, we designed this exercise to help the consulting statistician generate dialogue with clients that results in more productive collaborations and a more satisfying work experience.Statistics, Bioinformatics, Medicinebe2166, dm2418, hhn2108, ew2320StatisticsArticlesA Note on Equivalence Classes of Directed Acyclic Independence Graphs
http://academiccommons.columbia.edu/catalog/ac:173826
Madigan, David B.http://dx.doi.org/10.7916/D8TB150CTue, 13 May 2014 14:46:04 +0000Directed acyclic independence graphs (DAIGs) play an important role in recent developments in probabilistic expert systems and influence diagrams (Chyu [1]). The purpose of this note is to show that DAIGs can usefully be grouped into equivalence classes where the members of a single class share identical Markov properties. These equivalence classes can be identified via a simple graphical criterion. This result is particularly relevant to model selection procedures for DAIGs (see, e.g., Cooper and Herskovits [2] and Madigan and Raftery [4]) because it reduces the problem of searching among possible orientations of a given graph to that of searching among the equivalence classes.Mathematics, Statisticsdm2418StatisticsArticlesLocation Estimation in Wireless Networks: A Bayesian Approach
http://academiccommons.columbia.edu/catalog/ac:173820
Madigan, David B.; Ju, Wen-Hua; Krishnan, P.; Krishnakumar, A. S. ; Zorych, Ivanhttp://dx.doi.org/10.7916/D82V2D74Tue, 13 May 2014 14:25:34 +0000We present a Bayesian hierarchical model for indoor location estimation in wireless networks. We demonstrate that out model achieves accuracy that is similar to other published models and algorithms. By harnessing prior knowledge, our model drastically reduces the requirement for training data as compared with existing approaches.Mathematics, Statistics, Applied mathematicsdm2418StatisticsArticlesA Flexible Bayesian Generalized Linear Model for Dichotomous Response Data with an Application to Text Categorization
http://academiccommons.columbia.edu/catalog/ac:173817
Eyheramendy, Susana; Madigan, David B.http://dx.doi.org/10.7916/D86M34ZFTue, 13 May 2014 14:04:25 +0000We present a class of sparse generalized linear models that include probit and logistic regression as special cases and offer some extra flexibility. We provide an EM algorithm for learning the parameters of these models from data. We apply our method in text classification and in simulated data and show that our method outperforms the logistic and probit models and also the elastic net, in general by a substantial margin.Mathematics, Statistics, Theoretical mathematicsdm2418StatisticsBook chaptersFit GFuseTLP penalized conditional logistic regression model for high-dimensional one-to- one matched case-control data
http://academiccommons.columbia.edu/catalog/ac:174087
Zhou, Hui; Wang, Shuang; Zheng, Tianhttp://dx.doi.org/10.7916/D8028PNJMon, 12 May 2014 16:26:54 +0000Fit GFuseTLP penalized conditional logistic regression model for high-dimensional one-to- one matched case-control dataStatisticshz2240, sw2206, tz33Biostatistics, StatisticsComputer softwareA Point Process Model for the Dynamics of Limit Order Books
http://academiccommons.columbia.edu/catalog/ac:171221
Vinkovskaya, Ekaterinahttp://dx.doi.org/10.7916/D88913WWFri, 28 Feb 2014 16:44:16 +0000This thesis focuses on the statistical modeling of the dynamics of limit order books in electronic equity markets. The statistical properties of events affecting a limit order book -market orders, limit orders and cancellations- reveal strong evidence of clustering in time, cross-correlation across event types and dependence of the order flow on the bid-ask spread. Further investigation reveals the presence of a self-exciting property - that a large number of events in a given time period tends to imply a higher probability of observing a large number of events in the following time period. We show that these properties may be adequately represented by a multivariate self-exciting point process with multiple regimes that reflect changes in the bid-ask spread.
We propose a tractable parametrization of the model and perform a Maximum Likelihood Estimation of the model using high-frequency data from the Trades and Quotes database for US stocks. We show that the model may be used to obtain predictions of order flow and that its predictive performance beats the Poisson model as well as Moving Average and Auto Regressive time series models.StatisticsStatisticsDissertationsMixed Methods for Mixed Models
http://academiccommons.columbia.edu/catalog/ac:169644
Dorie, Vincent J.http://dx.doi.org/10.7916/D8V40S5XWed, 22 Jan 2014 14:28:18 +0000This work bridges the frequentist and Bayesian approaches to mixed models by borrowing the best features from both camps: point estimation procedures are combined with priors to obtain accurate, fast inference while posterior simulation techniques are developed that approximate the likelihood with great precision for the purposes of assessing uncertainty. These allow flexible inferences without the need to rely on expensive Markov chain Monte Carlo simulation techniques. Default priors are developed and evaluated in a variety of simulation and real-world settings with the end result that we propose a new set of standard approaches that yield superior performance at little computational cost.StatisticsStatisticsDissertationsKernel-based association measures
http://academiccommons.columbia.edu/catalog/ac:167034
Liu, Yinghttp://hdl.handle.net/10022/AC:P:22154Thu, 07 Nov 2013 15:12:35 +0000Measures of associations have been widely used for describing the statistical relationships between two sets of variables. Traditional association measures tend to focus on specialized settings (specific types of variables or association patterns). Based on an in-depth summary of existing measures, we propose a general framework for association measures unifying existing methods and novel extensions based on kernels, including practical solutions to computational challenges. The proposed framework provides improved feature selection and extensions to a variety of current classifiers. Specifically, we introduce association screening and variable selection via maximizing kernel-based association measures. We also develop a backward dropping procedure for feature selection when there are a large number of candidate variables. We evaluate our framework using a wide variety of both simulated and real data. In particular, we conduct independence tests and feature selection using kernel association measures on diversified association patterns of different dimensions and variable types. The results show the superiority of our methods to existing ones. We also apply our framework to four real-word problems, three from statistical genetics and one of gender prediction from handwriting. We demonstrate through these applications both the de novo construction of new kernels and the adaptation of existing kernels tailored to the data at hand, and how kernel-based measures of associations can be naturally applied to different data structures including functional input and output spaces. This shows that our framework can be applied to a wide range of real world problems and work well in practice.Statistics, Computer scienceyl2802StatisticsDissertationsLow-rank graphical models and Bayesian inference in the statistical analysis of noisy neural data
http://academiccommons.columbia.edu/catalog/ac:166472
Smith, Carl Alexanderhttp://hdl.handle.net/10022/AC:P:21991Fri, 11 Oct 2013 16:56:29 +0000We develop new methods of Bayesian inference, largely in the context of analysis of neuroscience data. The work is broken into several parts. In the first part, we introduce a novel class of joint probability distributions in which exact inference is tractable. Previously it has been difficult to find general constructions for models in which efficient exact inference is possible, outside of certain classical cases. We identify a class of such models that are tractable owing to a certain "low-rank" structure in the potentials that couple neighboring variables. In the second part we develop methods to quantify and measure information loss in analysis of neuronal spike train data due to two types of noise, making use of the ideas developed in the first part. Information about neuronal identity or temporal resolution may be lost during spike detection and sorting, or precision of spike times may be corrupted by various effects. We quantify the information lost due to these effects for the relatively simple but sufficiently broad class of Markovian model neurons. We find that decoders that model the probability distribution of spike-neuron assignments significantly outperform decoders that use only the most likely spike assignments. We also apply the ideas of the low-rank models from the first section to defining a class of prior distributions over the space of stimuli (or other covariate) which, by conjugacy, preserve the tractability of inference. In the third part, we treat Bayesian methods for the estimation of sparse signals, with application to the locating of synapses in a dendritic tree. We develop a compartmentalized model of the dendritic tree. Building on previous work that applied and generalized ideas of least angle regression to obtain a fast Bayesian solution to the resulting estimation problem, we describe two other approaches to the same problem, one employing a horseshoe prior and the other using various spike-and-slab priors. In the last part, we revisit the low-rank models of the first section and apply them to the problem of inferring orientation selectivity maps from noisy observations of orientation preference. The relevant low-rank model exploits the self-conjugacy of the von Mises distribution on the circle. Because the orientation map model is loopy, we cannot do exact inference on the low-rank model by the forward backward algorithm, but block-wise Gibbs sampling by the forward backward algorithm speeds mixing. We explore another von Mises coupling potential Gibbs sampler that proves to effectively smooth noisily observed orientation maps.Statistics, Neurosciencescas2207Chemistry, StatisticsDissertationsThe Challenge of Communicating Computational Research
http://academiccommons.columbia.edu/catalog/ac:165636
Hong, Neil Chue; Jockers, Matthew L.; Ellis, Daniel P. W.; Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:21703Fri, 20 Sep 2013 11:25:29 +0000Computational approaches to scholarship have revolutionized how research is done but have at the same time complicated the process of disseminating the results of that research. Conclusions may be produced using mathematical models or custom software that are not easily accessible to, or reproducible by, those outside the research team. And in some fields, a lack of understanding of computational approaches may lead to skepticism about their use. The panel considers urgent questions faced by researchers across the range of academic disciplines. How can scientists and social scientists address the lack of access to the software and code used to produce many research results, which has led to a crisis of verifiability and concern about the accuracy of the scientific record? How can digital humanists approach discussions of computational methods, which may not fit into traditional forms of scholarship and can be viewed with suspicion in disciplines that prize the art of scholarly analysis? Computational researchers are examining communication practices, policies, and tools that promise to more effectively convey their research process and the results it produces. The panelists are: Neil Chue Hong, Director of the Software Sustainability Institute; Matthew L. Jockers, Assistant Professor of English at the University of Nebraska-Lincoln; and Daniel P. W. Ellis, Associate Professor of Electrical Engineering at Columbia University.Technical communication, Information sciencede171, vcs2115Electrical Engineering, Statistics, Center for Digital Research and Scholarship, Scholarly Communication Program, Libraries and Information ServicesInterviews and roundtablesMeasuring Scholarly Impact: The Influence of 'Altmetrics'
http://academiccommons.columbia.edu/catalog/ac:165365
Priem, Jason; Holmes, Kristi; Trasande, Caitlin Aptowicz; Gelman, Andrew E.http://hdl.handle.net/10022/AC:P:21698Fri, 20 Sep 2013 10:24:48 +0000"Altmetrics" refers to methods of measuring scholarly impact using Web-based social media. Why does it matter? In many academic fields, attaining scholarly prestige means publishing research articles in important scholarly journals. However, many in the academic community consider a journal's prestige, which is determined by a metric calculated using the number of citations to the journal, to be a poor proxy for the quality of the individual author's work. At the same time, hiring and promotion committees are looking for ways to determine the impact of alternate formats now commonly used by researchers such as blogs, data sets, videos, and social media. The panelists all work with innovative new tools for assessing scholarly impact. They are: Jason Priem, Co-Founder, ImpactStory; Kristi Holmes, Bioinformaticist, Bernard Becker Medical Library, Washington University in St. Louis School of Medicine; and Caitlin Aptowicz Trasande, Head of Science Metrics, Digital Science.Information science, Information technologyag389Statistics, Center for Digital Research and Scholarship, Scholarly Communication Program, Libraries and Information ServicesInterviews and roundtablesGeneralized Volatility-Stabilized Processes
http://academiccommons.columbia.edu/catalog/ac:165162
Pickova, Radkahttp://hdl.handle.net/10022/AC:P:21616Fri, 13 Sep 2013 15:07:49 +0000In this thesis, we consider systems of interacting diffusion processes which we call Generalized Volatility-Stabilized processes, as they extend the Volatility-Stabilized Market models introduced in Fernholz and Karatzas (2005). First, we show how to construct a weak solution of the underlying system of stochastic differential equations. In particular, we express the solution in terms of time-changed squared-Bessel processes and argue that this solution is unique in distribution. In addition, we also discuss sufficient conditions under which this solution does not explode in finite time, and provide sufficient conditions for pathwise uniqueness and for existence of a strong solution.
Secondly, we discuss the significance of these processes in the context of Stochastic Portfolio Theory. We describe specific market models which assume that the dynamics of the stocks' capitalizations is the same as that of the Generalized Volatility-Stabilized processes, and we argue that strong relative arbitrage opportunities may exist in these markets, specifically, we provide multiple examples of portfolios that outperform the market portfolio. Moreover, we examine the properties of market weights as well as the diversity weighted portfolio in these models.
Thirdly, we provide some asymptotic results for these processes which allows us to describe different properties of the corresponding market models based on these processes.Statisticsrp2424Statistics, MathematicsDissertationsRe-use and Reproducibility: Opportunities and Challenges
http://academiccommons.columbia.edu/catalog/ac:162944
Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:20964Tue, 09 Jul 2013 09:37:23 +0000To support the reliability and accuracy of the scientific record, science policy, research infrastructure, and the culture of science must facilitate the sharing of data and code resulting from scientific research, much of which is now produced using computational methods. Though the need to support the reproducibility of computational research is now widely recognized, copyright and other factors present challenges to the development of policies and practices.Technical communication, Information technologyvcs2115StatisticsPresentationsVariability of Universal Life Cash Flows under Higher Risk Investment Strategies
http://academiccommons.columbia.edu/catalog/ac:162700
Tayal, Abhishek; Yang, Canning; Dunn, Thomas P.http://hdl.handle.net/10022/AC:P:20851Thu, 27 Jun 2013 16:09:42 +0000This integrated project studied the offsetting elements of higher nominal yields, greater credit loss expectations, and higher capital requirements on the profitability of the life insurer that pursues a higher yield investment strategy. Profitability measures were developed for a Universal Life product. The report provides an attribution of profit drivers for the insurer. The effects of credit rating migration on credit loss rates and bond capital charges were examined, and investment strategies were tested under credit stress scenarios.Financeat2842, cy2315, tpd2111Actuarial Sciences, StatisticsReportsCredit Risk Modeling and Analysis Using Copula Method and Changepoint Approach to Survival Data
http://academiccommons.columbia.edu/catalog/ac:161682
Qian, Bohttp://hdl.handle.net/10022/AC:P:20510Thu, 30 May 2013 16:36:22 +0000This thesis consists of two parts. The first part uses Gaussian Copula and Student's t Copula as the main tools to model the credit risk in securitizations and re-securitizations. The second part proposes a statistical procedure to identify changepoints in Cox model of survival data. The recent 2007-2009 financial crisis has been regarded as the worst financial crisis since the Great Depression by leading economists. The securitization sector took a lot of blame for the crisis because of the connection of the securitized products created from mortgages to the collapse of the housing market. The first part of this thesis explores the relationship between securitized mortgage products and the 2007-2009 financial crisis using the Copula method as the main tool. We show in this part how loss distributions of securitizations and re-securitizations can be derived or calculated in a new model. Simulations are conducted to examine the effectiveness of the model. As an application, the model is also used to examine whether and where the ratings of securitized products could be flawed. On the other hand, the lag effect and saturation effect problems are common and important problems in survival data analysis. They belong to a general class of problems where the treatment effect takes occasional jumps instead of staying constant throughout time. Therefore, they are essentially the changepoint problems in statistics. The second part of this thesis focuses on extending Lai and Xing's recent work in changepoint modeling, which was developed under a time series and Bayesian setup, to the lag effect problems in survival data. A general changepoint approach for Cox model is developed. Simulations and real data analyses are conducted to illustrate the effectiveness of the procedure and how it should be implemented and interpreted.Statisticsbq2102StatisticsDissertationsWhy Public Access to Data is So Important (and why getting the policy right is even more so)
http://academiccommons.columbia.edu/catalog/ac:161424
Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:20387Tue, 21 May 2013 12:08:29 +0000Open data is crucial to science today. Computation is becoming central to scientific research. “Open Data” is not well-defined. Scope: Share data and code that permit others in the field to replicate published results. (traditionally done by the publication alone).Information technology, Technical communicationvcs2115StatisticsPresentationsOn optimal arbitrage under constraints
http://academiccommons.columbia.edu/catalog/ac:160495
Sadhukhan, Subhankarhttp://hdl.handle.net/10022/AC:P:20076Wed, 01 May 2013 11:07:50 +0000In this thesis, we investigate the existence of relative arbitrage opportunities in a Markovian model of a financial market, which consists of a bond and stocks, whose prices evolve like Itô processes. We consider markets where investors are constrained to choose from among a restricted set of investment strategies. We show that the upper hedging price of (i.e. the minimum amount of wealth needed to superreplicate) a given contingent claim in a constrained market can be expressed as the supremum of the fair price of the given contingent claim under certain unconstrained auxiliary Markovian markets. Under suitable assumptions, we further characterize the upper hedging price as viscosity solution to certain variational inequalities. We, then, use this viscosity solution characterization to study how the imposition of stricter constraints on the market affect the upper hedging price. In particular, if relative arbitrage opportunities exist with respect to a given strategy, we study how stricter constraints can make such arbitrage opportunities disappear.Applied mathematics, Financess3240Statistics, MathematicsDissertationsStatistical Inference for Diagnostic Classification Models
http://academiccommons.columbia.edu/catalog/ac:160464
Xu, Gongjunhttp://hdl.handle.net/10022/AC:P:20058Tue, 30 Apr 2013 16:06:11 +0000Diagnostic classification models (DCM) are an important recent development in educational and psychological testing. Instead of an overall test score, a diagnostic test provides each subject with a profile detailing the concepts and skills (often called "attributes") that he/she has mastered. Central to many DCMs is the so-called Q-matrix, an incidence matrix specifying the item-attribute relationship. It is common practice for the Q-matrix to be specified by experts when items are written, rather than through data-driven calibration. Such a non-empirical approach may lead to misspecification of the Q-matrix and substantial lack of model fit, resulting in erroneous interpretation of testing results. This motivates our study and we consider the identifiability, estimation, and hypothesis testing of the Q-matrix. In addition, we study the identifiability of diagnostic model parameters under a known Q-matrix. The first part of this thesis is concerned with estimation of the Q-matrix. In particular, we present definitive answers to the learnability of the Q-matrix for one of the most commonly used models, the DINA model, by specifying a set of sufficient conditions under which the Q-matrix is identifiable up to an explicitly defined equivalence class. We also present the corresponding data-driven construction of the Q-matrix. The results and analysis strategies are general in the sense that they can be further extended to other diagnostic models. The second part of the thesis focuses on statistical validation of the Q-matrix. The purpose of this study is to provide a statistical procedure to help decide whether to accept the Q-matrix provided by the experts. Statistically, this problem can be formulated as a pure significance testing problem with null hypothesis H0 : Q = Q0, where Q0 is the candidate Q-matrix. We propose a test statistic that measures the consistency of observed data with the proposed Q-matrix. Theoretical properties of the test statistic are studied. In addition, we conduct simulation studies to show the performance of the proposed procedure. The third part of this thesis is concerned with the identifiability of the diagnostic model parameters when the Q-matrix is correctly specified. Identifiability is a prerequisite for statistical inference, such as parameter estimation and hypothesis testing. We present sufficient and necessary conditions under which the model parameters are identifiable from the response data.Statistics, Educational tests and measurementsgx2108StatisticsDissertationsTestimony submitted to the House Committee on Science, Space and Technology for the March 5, 2013 hearing on Scientific Integrity and Transparency.
http://academiccommons.columbia.edu/catalog/ac:157889
Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:19394Thu, 21 Mar 2013 16:23:38 +0000Reproducibility is a new challenge, brought about by advances in scientific research capability due to immense changes in technology over the last two decades. It is widely recognized as a defining hallmark of science and directly impacts the transparency and reliability of findings, and is taken very seriously by the scientific community.Technical communication, Information technologyvcs2115StatisticsPresentationsOpen Data, Open Methods, and the Promise of Large Scale Validation.
http://academiccommons.columbia.edu/catalog/ac:157883
Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:19393Thu, 21 Mar 2013 16:15:09 +0000Reproducibility is core to science, and a critical issue in computational science,Technical communication, Information technologyvcs2115StatisticsPresentationsDigital Scholarship in Scientific Research: Open Questions in Reproducibility and Curation.
http://academiccommons.columbia.edu/catalog/ac:157879
Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:19392Thu, 21 Mar 2013 16:08:58 +0000Computation presents only a potential third branch of the scientific
method.Technical communication, Information technologyvcs2115StatisticsPresentationsTechnology and the Scientific Method: The Credibility Crisis in Computational Science
http://academiccommons.columbia.edu/catalog/ac:157876
Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:19391Thu, 21 Mar 2013 16:00:08 +0000Computation presents only a potential third branch of the scientific
method.Technical communication, Information technologyvcs2115StatisticsPresentationsFacilitating Reproducibility: Open Data and Code in Economics
http://academiccommons.columbia.edu/catalog/ac:157873
Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:19390Thu, 21 Mar 2013 15:48:04 +0000The aim of the workshop is to build an understanding of the value of open data and open tools for the Economics profession and the obstacles to opening up information, as well as the role of greater openness in broadening understanding of and engagement with Economics among the wider community including policy-makers and society.Technical communication, Economicsvcs2115StatisticsPresentationsBayesian Model Selection in terms of Kullback-Leibler discrepancy
http://academiccommons.columbia.edu/catalog/ac:158374
Zhou, Shouhaohttp://hdl.handle.net/10022/AC:P:19157Mon, 25 Feb 2013 13:36:40 +0000In this article we investigate and develop the practical model assessment and selection methods for Bayesian models, when we anticipate that a promising approach should be objective enough to accept, easy enough to understand, general enough to apply, simple enough to compute and coherent enough to interpret. We mainly restrict attention to the Kullback-Leibler divergence, a widely applied model evaluation measurement to quantify the similarity between the proposed candidate model and the underlying true model, where the true model is only referred to a probability distribution as the best projection onto the statistical modeling space once we try to understand the real but unknown dynamics/mechanism of interest. In addition to review and discussion on the advantages and disadvantages of the historically and currently prevailing practical model selection methods in literature, a series of convenient and useful tools, each designed and applied for different purposes, are proposed to asymptotically unbiasedly assess how the candidate Bayesian models are favored in terms of predicting a future independent observation. What's more, we also explore the connection of the Kullback-Leibler based information criterion to the Bayes factors, another most popular Bayesian model comparison approaches, after seeing the motivation through the developments of the Bayes factor variants. In general, we expect to provide a useful guidance for researchers who are interested in conducting Bayesian data analysis.Statisticssz2020StatisticsDissertationsMultiplicative Multiresolution Analysis for Lie-group Valued Data Indexed by a Euclidean Parameter
http://academiccommons.columbia.edu/catalog/ac:155756
Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:15397Wed, 12 Dec 2012 15:17:09 +0000Lie-valued euclidean indexed data. These data might be: phase angles as functions of time or space, for example compass directions; 3D orientations of a rigid frame of reference as a function of time or space; or, quaternions as a function of time or space. This can also be extended to quotients of lie groups which gives us the ability to model points on S2, the unit sphere, as functions of time or space.Computer science, Statisticsvcs2115StatisticsPresentationsA Brief History of the Reproducibility Movement
http://academiccommons.columbia.edu/catalog/ac:155759
Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:15396Wed, 12 Dec 2012 14:51:16 +0000Computational science cannot be elevated to a third branch of the scientific method until it generates routinely verifiable knowledge.Technical communication, Computer sciencevcs2115StatisticsPresentationsTransparency in Computational Science
http://academiccommons.columbia.edu/catalog/ac:154852
Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:15360Tue, 27 Nov 2012 14:16:44 +0000The central motivation for the scientific method is to root out error: Computational science as practiced today does not generate reliable knowledge. This presentation looks at four possible solutions to the issues of transparency in computational science.Technical communication, Computer sciencevcs2115StatisticsPresentationsDiscussant: “Pornography and Divorce”
http://academiccommons.columbia.edu/catalog/ac:154713
Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:15350Wed, 21 Nov 2012 13:19:39 +0000A presentation on data and design suggestions for research on the topic of pornography and divorce.Technical communication, Intellectual propertyvcs2115StatisticsPresentationsRunMyCode.org: a Novel Dissemination and Collaboration Platform for Executing Published Computational Results
http://academiccommons.columbia.edu/catalog/ac:154716
Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:15349Wed, 21 Nov 2012 13:15:47 +0000A presentation on a collaboration platform for executing published computational results.Technical communication, Intellectual propertyvcs2115StatisticsPresentationsJournal Policy and Reproducible Computational Research
http://academiccommons.columbia.edu/catalog/ac:154719
Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:15348Wed, 21 Nov 2012 13:05:57 +0000Discusses policy possibilities for the issues of reproducibility and dissemination in computational science.Technical communication, Intellectual propertyvcs2115StatisticsPresentationsTowards Reproducible Science: Policy and a Path Forward
http://academiccommons.columbia.edu/catalog/ac:154722
Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:15347Wed, 21 Nov 2012 12:59:32 +0000Discusses solutions and policy possibilities for the issues of reproducibility and dissemination in computational science.Technical communication, Intellectual propertyvcs2115StatisticsPresentationsMultiple Imputation with Diagnostics (mi) in R: Opening Windows into the Black Box
http://academiccommons.columbia.edu/catalog/ac:154731
Su, Yu-Sung; Yajima, Masanao; Gelman, Andrew E.; Hill, Jenniferhttp://hdl.handle.net/10022/AC:P:15342Tue, 20 Nov 2012 16:49:06 +0000Our mi package in R has several features that allow the user to get inside the imputation process and evaluate the reasonableness of the resulting models and imputations. These features include: choice of predictors, models, and transformations for chained imputation models; standard and binned residual plots for checking the fit of the conditional distributions used for imputation; and plots for comparing the distributions of observed and imputed data. In addition, we use Bayesian models and weakly informative prior distributions to construct more stable estimates of imputation models. Our goal is to have a demonstration package that (a) avoids many of the practical problems that arise with existing multivariate imputation programs, and (b) demonstrates state-of-the-art diagnostics that can be applied more generally and can be incorporated into the software of others.Statisticsag389Statistics, Political ScienceArticlesR2WinBUGS: A Package for Running WinBUGS from R
http://academiccommons.columbia.edu/catalog/ac:154734
Sturtz, Sibylle; Ligges, Uwe; Gelman, Andrew E.http://hdl.handle.net/10022/AC:P:15341Tue, 20 Nov 2012 16:42:45 +0000The R2WinBUGS package provides convenient functions to call WinBUGS from R. It automatically writes the data and scripts in a format readable by WinBUGS for processing in batch mode, which is possible since version 1.4. After the WinBUGS process has finished, it is possible either to read the resulting data into R by the package itself—which gives a compact graphical summary of inference and convergence diagnostics—or to use the facilities of the coda package for further analyses of the output. Examples are given to demonstrate the usage of this package.Statisticsag389Statistics, Political ScienceArticlesBayesian Statistical Pragmatism
http://academiccommons.columbia.edu/catalog/ac:154737
Gelman, Andrew E.http://hdl.handle.net/10022/AC:P:15340Tue, 20 Nov 2012 16:38:18 +0000I agree with Rob Kass’ point that we can and should make use of statistical methods developed under different philosophies, and I am happy to take the opportunity to elaborate on some of his arguments.Statisticsag389Statistics, Political ScienceArticlesSegregation in Social Networks Based on Acquaintanceship and Trust
http://academiccommons.columbia.edu/catalog/ac:154740
DiPrete, Thomas A.; Gelman, Andrew E.; McCormick, Tyler; Teitler, Julien O.; Zheng, Tianhttp://hdl.handle.net/10022/AC:P:15339Tue, 20 Nov 2012 16:17:57 +0000Using 2006 General Social Survey data, the authors compare levels of segregation by race and along other dimensions of potential social cleavage in the contemporary United States. Americans are not as isolated as the most extreme recent estimates suggest. However, hopes that “bridging” social capital is more common in broader acquaintanceship networks than in core networks are not supported. Instead, the entire acquaintanceship network is perceived by Americans to be about as segregated as the much smaller network of close ties. People do not always know the religiosity, political ideology, family behaviors, or socioeconomic status of their acquaintances, but perceived social divisions on these dimensions are high, sometimes rivaling racial segregation in acquaintanceship networks. The major challenge to social integration today comes from the tendency of many Americans to isolate themselves from others who differ on race, political ideology, level of religiosity, and other salient aspects of social identity.Statisticstad61, ag389, thm2105, jot8, tz33Sociology, Statistics, Political Science, Social WorkArticlesSoftware Patents as a Barrier to Scientific Transparency: An Unexpected Consequence of Bayh-Dole
http://academiccommons.columbia.edu/catalog/ac:155777
Reich, Isabel Rose ; Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:15328Tue, 20 Nov 2012 14:27:42 +0000Discusses solutions to the reproducibility and dissemination issues in computational science. Examines the interaction between the digitization of science and Intellectual Property Law, specifically the incentives created by the Bayh‐Dole Act to patent inventions associated with university‐based research.Technical communication, Intellectual propertyirr2105, vcs2115Applied Physics and Applied Mathematics, StatisticsPresentationsData-Intensive Science: Methods for Reproducibility and Dissemination
http://academiccommons.columbia.edu/catalog/ac:154952
Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:15327Tue, 20 Nov 2012 14:15:54 +0000Discusses solutions to the reproducibility dissemination issues in computational science.Technical communication, Intellectual propertyvcs2115StatisticsPresentationsThe Reproducible Research Movement: Crisis and Solutions
http://academiccommons.columbia.edu/catalog/ac:155774
Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:15326Tue, 20 Nov 2012 13:56:49 +0000Discusses solutions to the reproducibility of computational research in computational science.Technical communication, Intellectual propertyvcs2115StatisticsPresentationsDisseminating Numerically Reproducible Research
http://academiccommons.columbia.edu/catalog/ac:154846
Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:15325Tue, 20 Nov 2012 13:27:52 +0000Discusses solutions to the reproducible computational research in computational science.Technical communication, Intellectual propertyvcs2115StatisticsPresentationsMethods for studying the neural code in high dimensions
http://academiccommons.columbia.edu/catalog/ac:152510
Ramirez, Alexandro D.http://hdl.handle.net/10022/AC:P:14688Wed, 12 Sep 2012 16:25:41 +0000Over the last two decades technological developments in multi-electrode arrays and fluorescence microscopy have made it possible to simultaneously record from hundreds to thousands of neurons. Developing methods for analyzing these data in order to learn how networks of neurons respond to external stimuli and process information is an outstanding challenge for neuroscience. In this dissertation, I address the challenge of developing and testing models that are both flexible and computationally tractable when used with high dimensional data. In chapter 2 I will discuss an approximation to the generalized linear model (GLM) log-likelihood that I developed in collaboration with my thesis advisor. This approximation is designed to ease the computational burden of evaluating GLMs. I will show that our method reduces the computational cost of evaluating the GLM log-likelihood by a factor proportional to the number of parameters in the model times the number of observations. Therefore it is most beneficial in typical neuroscience applications where the number of parameters is large. I then detail a variety of applications where our method can be of use, including Maximum Likelihood estimation of GLM parameters, marginal likelihood calculations for model selection and Markov chain Monte Carlo methods for sampling from posterior parameter distributions. I go on to show that our model does not necessarily sacrifice accuracy for speed. Using both analytic calculations and multi-unit, primate retinal responses, I show that parameter estimates and predictions using our model can have the same accuracy as that of generalized linear models. In chapter 3 I study the neural decoding problem of predicting stimuli from neuronal responses. The focus is on reconstructing zebra finch song spectrograms, which are high-dimensional, by combining the spike trains of zebra finch auditory midbrain neurons with information about the correlations present in all zebra finch song. I use a GLM to model neuronal responses and a series of prior distributions, each carrying different amounts of statistical information about zebra finch song. For song reconstruction I make use of recent connections made between the applied mathematics literature on solving linear systems of equations involving matrices with special structure and neural decoding. This allowed me to calculate \textit{maximum a posteriori} (MAP) estimates of song spectrograms in a time that only grows linearly, and is therefore quite tractable, with the number of time-bins in the song spectrogram. This speed was beneficial for answering questions which required the reconstruction of a variety of song spectrograms each corresponding to different priors made on the distribution of zebra finch song. My collaborators and I found that spike trains from a population of MLd neurons combined with an uncorrelated Gaussian prior can estimate the amplitude envelope of song spectrograms. The same set of responses can be combined with Gaussian priors that have correlations matched to those found across multiple zebra finch songs to yield song spectrograms similar to those presented to the animal. The fidelity of spectrogram reconstructions from MLd responses relies more heavily on prior knowledge of spectral correlations than temporal correlations. However the best reconstructions combine MLd responses with both spectral and temporal correlations.Neurosciencesadr2110Neurobiology and Behavior, Neuroscience, StatisticsDissertationsModeling Strategies for Large Dimensional Vector Autoregressions
http://academiccommons.columbia.edu/catalog/ac:152472
Zang, Pengfeihttp://hdl.handle.net/10022/AC:P:14666Tue, 11 Sep 2012 15:31:00 +0000The vector autoregressive (VAR) model has been widely used for describing the dynamic behavior of multivariate time series. However, fitting standard VAR models to large dimensional time series is challenging primarily due to the large number of parameters involved. In this thesis, we propose two strategies for fitting large dimensional VAR models. The first strategy involves reducing the number of non-zero entries in the autoregressive (AR) coefficient matrices and the second is a method to reduce the effective dimension of the white noise covariance matrix. We propose a 2-stage approach for fitting large dimensional VAR models where many of the AR coefficients are zero. The first stage provides initial selection of non-zero AR coefficients by taking advantage of the properties of partial spectral coherence (PSC) in conjunction with BIC. The second stage, based on $t$-ratios and BIC, further refines the spurious non-zero AR coefficients post first stage. Our simulation study suggests that the 2-stage approach outperforms Lasso-type methods in discovering sparsity patterns in AR coefficient matrices of VAR models. The performance of our 2-stage approach is also illustrated with three real data examples. Our second strategy for reducing the complexity of a large dimensional VAR model is based on a reduced-rank estimator for the white noise covariance matrix. We first derive the reduced-rank covariance estimator under the setting of independent observations and give the analytical form of its maximum likelihood estimate. Then we describe how to integrate the proposed reduced-rank estimator into the fitting of large dimensional VAR models, where we consider two scenarios that require different model fitting procedures. In the VAR modeling context, our reduced-rank covariance estimator not only provides interpretable descriptions of the dependence structure of VAR processes but also leads to improvement in model-fitting and forecasting over unrestricted covariance estimators. Two real data examples are presented to illustrate these fitting procedures.Statisticspz2146StatisticsDissertationsSome Models for Time Series of Counts
http://academiccommons.columbia.edu/catalog/ac:152149
Liu, Henghttp://hdl.handle.net/10022/AC:P:14561Wed, 29 Aug 2012 14:08:58 +0000This thesis focuses on developing nonlinear time series models and establishing relevant theory with a view towards applications in which the responses are integer valued. The discreteness of the observations, which is not appropriate with classical time series models, requires novel modeling strategies. The majority of the existing models for time series of counts assume that the observations follow a Poisson distribution conditional on an accompanying intensity process that drives the serial dynamics of the model. According to whether the evolution of the intensity process depends on the observations or solely on an external process, the models are classified into parameter-driven and observation-driven. Compared to the former one, an observation-driven model often allows for easier and more straightforward estimation of the model parameters. On the other hand, the stability properties of the process, such as the existence and uniqueness of a stationary and ergodic solution that are required for deriving asymptotic theory of the parameter estimates, can be quite complicated to establish, as compared to parameter-driven models. In this thesis, we first propose a broad class of observation-driven models that is based upon a one-parameter exponential family of distributions and incorporates nonlinear dynamics. The establishment of stability properties of these processes, which is at the heart of this thesis, is addressed by employing theory from iterated random functions and coupling techniques. Using this theory, we are also able to obtain the asymptotic behavior of maximum likelihood estimates of the parameters. Extensions of the base model in several directions are considered. Inspired by the idea of a self-excited threshold ARMA process, a threshold Poisson autoregression is proposed. It introduces a two-regime structure in the intensity process and essentially allows for modeling negatively correlated observations. E-chain, a non-standard Markov chain technique and Lyapunov's method are utilized to show the stationarity and a law of large numbers for this process. In addition, the model has been adapted to incorporate covariates, an important problem of practical and primary interest. The base model is also extended to consider the case of multivariate time series of counts. Given a suitable definition of a multivariate Poisson distribution, a multivariate Poisson autoregression process is described and its properties studied. Several simulation studies are presented to illustrate the inference theory. The proposed models are also applied to several real data sets, including the number of transactions of the Ericsson stock, the return times of Goldman Sachs Group stock prices, the number of road crashes in Schiphol, the frequencies of occurrences of gold particles, the incidences of polio in the US and the number of presentations of asthma in an Australian hospital. An array of graphical and quantitative diagnostic tools, which is specifically designed for the evaluation of goodness of fit for time series of counts models, is described and illustrated with these data sets.Statisticshl2494StatisticsDissertationsStatistical inference in two non-standard regression problems
http://academiccommons.columbia.edu/catalog/ac:151460
Seijo, Emilio Franciscohttp://hdl.handle.net/10022/AC:P:14317Wed, 08 Aug 2012 13:43:26 +0000This thesis analyzes two regression models in which their respective least squares estimators have nonstandard asymptotics. It is divided in an introduction and two parts. The introduction motivates the study of nonstandard problems and presents an outline of the contents of the remaining chapters. In part I, the least squares estimator of a multivariate convex regression function is studied in great detail. The main contribution here is a proof of the consistency of the aforementioned estimator in a completely nonparametric setting. Model misspecification, local rates of convergence and multidimensional regression models mixing convexity and componentwise monotonicity constraints will also be considered. Part II deals with change-point regression models and the issues that might arise when applying the bootstrap to these problems. The classical bootstrap is shown to be inconsistent on a simple change-point regression model, and an alternative (smoothed) bootstrap procedure is proposed and proved to be consistent. The superiority of the alternative method is also illustrated through a simulation study. In addition, a version of the continuous mapping theorem specially suited for change-point estimators is proved and used to derive the results concerning the bootstrap.Statistics, Applied mathematics, Mathematicsefs2113StatisticsDissertationsOpen Challenges to Open Science
http://academiccommons.columbia.edu/catalog/ac:147784
Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:13497Mon, 11 Jun 2012 12:04:48 +0000Discusses solutions to the credibility crisis in computational science.Technical communication, Intellectual propertyvcs2115StatisticsPresentationsTransparency in Scientific Discovery: Innovation and Knowledge Dissemination
http://academiccommons.columbia.edu/catalog/ac:147781
Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:13496Mon, 11 Jun 2012 12:00:56 +0000Discusses solutions to the credibility crisis in computational science.Technical communication, Intellectual propertyvcs2115StatisticsPresentationsFraming Science Policy: Reproducible Research, Not Open Data
http://academiccommons.columbia.edu/catalog/ac:147778
Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:13495Mon, 11 Jun 2012 11:56:52 +0000Discusses open data and open code as solutions to the credibility crisis in computational science.Technical communication, Intellectual propertyvcs2115StatisticsPresentationsReproducible Research in Computational Science: Strategies for Innovation
http://academiccommons.columbia.edu/catalog/ac:147775
Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:13494Mon, 11 Jun 2012 11:53:41 +0000Discusses solutions to the credibility crisis in computational science.Technical communication, Intellectual propertyvcs2115StatisticsPresentationsComments on "Measuring Racial Profiling"
http://academiccommons.columbia.edu/catalog/ac:147771
Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:13493Mon, 11 Jun 2012 11:39:12 +0000Discussion of a quantitative analysis of race and criminal justice.Criminologyvcs2115StatisticsPresentationsThe Credibility Crisis in Computational Science: A Call to Action
http://academiccommons.columbia.edu/catalog/ac:147766
Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:13492Mon, 11 Jun 2012 11:36:49 +0000Discusses solutions to the credibility crisis in computational science.Technical communication, Intellectual propertyvcs2115StatisticsPresentationsReproducible Research: A Digital Curation Agenda
http://academiccommons.columbia.edu/catalog/ac:147763
Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:13491Mon, 11 Jun 2012 11:28:14 +0000Discusses the necessity of open data and open code as a solution to the credibility crisis in computational science.Information science, Intellectual propertyvcs2115StatisticsPresentationsReproducibility in Computational Science
http://academiccommons.columbia.edu/catalog/ac:147760
Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:13490Mon, 11 Jun 2012 11:12:58 +0000Discusses solutions to the credibility crisis in computational science.Technical communication, Intellectual propertyvcs2115StatisticsPresentationsThe Credibility Crisis in Computational Science: An Information Issue
http://academiccommons.columbia.edu/catalog/ac:147757
Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:13489Mon, 11 Jun 2012 11:08:37 +0000Discusses solutions to the credibility crisis in computational science.Technical communication, Intellectual propertyvcs2115StatisticsPresentationsThe Reproducible Computational Science Movement: Tools, Policy, and Results
http://academiccommons.columbia.edu/catalog/ac:147754
Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:13488Mon, 11 Jun 2012 11:02:50 +0000Discusses solutions to the credibility crisis in computational science.Technical communication, Intellectual propertyvcs2115StatisticsPresentationsBuilding the Reproducible Computational Science Movement: Catalysing Action through Policy, Software Tools, and Ideas
http://academiccommons.columbia.edu/catalog/ac:147751
Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:13487Mon, 11 Jun 2012 10:58:44 +0000Discusses solutions to the credibility crisis in computational science.Technical communication, Intellectual propertyvcs2115StatisticsPresentationsIntellectual Property and Innovation in Computational Science: Dissemination of Ideas and Methodology
http://academiccommons.columbia.edu/catalog/ac:147748
Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:13486Mon, 11 Jun 2012 10:38:52 +0000Discusses solutions to the credibility crisis in computational science.Technical communication, Intellectual propertyvcs2115StatisticsPresentationsCopyright and MetaData in the World Heritage Digital Mathematical Library
http://academiccommons.columbia.edu/catalog/ac:147745
Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:13485Mon, 11 Jun 2012 10:16:28 +0000Intellectual property, Library sciencevcs2115StatisticsPresentationsScientists, Share Secrets or Lose Funding
http://academiccommons.columbia.edu/catalog/ac:147742
Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:13484Mon, 11 Jun 2012 09:40:44 +0000More and more published scientific studies are difficult or impossible to repeat. Federal agencies that fund scientific research are in a position to help fix this problem. They should require that all scientists whose studies they finance share the files that generated their published findings, the raw data and the computer instructions that carried out their analysis.Technical communication, Intellectual propertyvcs2115StatisticsArticlesThe Central Role of Geophysics in the Reproducible Research Movement
http://academiccommons.columbia.edu/catalog/ac:147729
Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:13456Fri, 08 Jun 2012 09:54:48 +0000Discusses solutions to the credibility crisis in computational science.Technical communication, Intellectual propertyvcs2115StatisticsPresentationsGenus Distributions of Graphs Constructed Through Amalgamations
http://academiccommons.columbia.edu/catalog/ac:146091
Poshni, Mehvish Irfanhttp://hdl.handle.net/10022/AC:P:12989Thu, 12 Apr 2012 12:46:03 +0000Graphs are commonly represented as points in space connected by lines. The points in space are the vertices of the graph, and the lines joining them are the edges of the graph. A general definition of a graph is considered here, where multiple edges are allowed between two vertices and an edge is permitted to connect a vertex to itself. It is assumed that graphs are connected, i.e., any vertex in the graph is reachable from another distinct vertex either directly through an edge connecting them or by a path consisting of intermediate vertices and connecting edges. Under this visual representation, graphs can be drawn on various surfaces. The focus of my research is restricted to a class of surfaces that are characterized as compact connected orientable 2-manifolds. The drawings of graphs on surfaces that are of primary interest follow certain prescribed rules. These are called 2-cellular graph embeddings, or simply embeddings. A well-known closed formula makes it easy to enumerate the total number of 2-cellular embeddings for a given graph over all surfaces. A much harder task is to give a surface-wise breakdown of this number as a sequence of numbers that count the number of 2-cellular embeddings of a graph for each orientable surface. This sequence of numbers for a graph is known as the genus distribution of a graph. Prior research on genus distributions of graphs has primarily focused on making calculations of genus distributions for specific families of graphs. These families of graphs have often been contrived, and the methods used for finding their genus distributions have not been general enough to extend to other graph families. The research I have undertaken aims at developing and using a general method that frames the problem of calculating genus distributions of large graphs in terms of a partitioning of the genus distributions of smaller graphs. To this end, I use various operations such as edge-amalgamation, self-edge-amalgamation, and vertex-amalgamation to construct large graphs out of smaller graphs, by coupling their vertices and edges together in certain consistent ways. This method assumes that the partitioned genus distribution of the smaller graphs is known or is easily calculable by computer, for instance, by using the famous Heffter-Edmonds algorithm. As an outcome of the techniques used, I obtain general recurrences and closed-formulas that give genus distributions for infinitely many recursively specifiable graph families. I also give an easily understood method for finding non-trivial examples of distinct graphs having the same genus distribution. In addition to this, I describe an algorithm that computes the genus distributions for a family of graphs known as the 4-regular outerplanar graphs.Computer sciencemp2452Computer Science, StatisticsDissertationsData Management and Federal Funding: What Researchers Need to Know
http://academiccommons.columbia.edu/catalog/ac:142524
Choudhury, Sayeed; Stodden, Victoria C.; Lehnert, Kerstin A.; Schlosser, Peterhttp://hdl.handle.net/10022/AC:P:11997Wed, 14 Dec 2011 11:22:22 +0000New requirements from the National Science Foundation and other federal agencies have brought data management and sharing into the spotlight. This trend will continue as more research sponsors, and the general public, demand increased access to federally-funded research data. This event examines the goals of these requirements and explore the technical, scientific, and professional challenges resulting from efforts to preserve and share data.Information sciencevcs2115, kal50, ps10Statistics, Lamont-Doherty Earth Observatory, Earth and Environmental Engineering, Earth Institute, Libraries and Information Services, Center for Digital Research and Scholarship, Scholarly Communication ProgramInterviews and roundtablesFinding a Maximum-Genius Graph Impeding
http://academiccommons.columbia.edu/catalog/ac:142035
Furst, Merrick L.; Gross, Jonathan L.; McGeoch, Lyle A.http://hdl.handle.net/10022/AC:P:11837Mon, 28 Nov 2011 12:37:02 +0000The computational complexity of constructing the imbeddings of a given graph into surfaces of different genus is not well-understood. In this paper, topological methods and a reduction to linear matroid parity are used to develop a polynomial-time algorithm to find a maximum-genus cellular imbedding. This seems to be the first imbedding algorithm for which the running time is not exponential in the genus of the imbedding surface.Computer sciencejlg2Computer Science, StatisticsTechnical reportsAn Information-Theoretic Scale for Cultural Rule Systems
http://academiccommons.columbia.edu/catalog/ac:140503
Gross, Jonathan L.http://hdl.handle.net/10022/AC:P:11478Tue, 18 Oct 2011 11:57:04 +0000Important cultural messages are expressed in nonverbal media such as food, clothing, or the allocation of space or time. For instance, how and what a group of persons eats on a particular occasion may convey public information about that occasion and about the group of persons eating together. Whereas attention seems to be most commonly directed toward the individual character of the information, the present concern is the quantity of public information, as observed in the pattern of nonverbal cultural signs. To measure this quantity, it is proposed that the pattern of cultural signs be encoded as a sequence of abstract symbols (e.g. letters of the alphabet) and its complexity appraised by a suitably adapted form of the measure of Kolmogorov and Chaitin. That is, an algorithmic language is constructed and the mathematical information quantity is reckoned as the length of the shortest program that yields the sequence. In this cultural context, the measure is called "intricacy". By focusing on syntactic structure and pattern variation rather than on background levels, intricacy resists some influences of material wealth that tend to distort comparisons of individuals and groups. A compact mathematical overview of the theory is presented and an experiment to test it within the social medium of food sharing is briefly described.Information science, Sociology, Applied mathematicsjlg2Computer Science, StatisticsTechnical reportsMultiscale Representations for Manifold-Valued Data
http://academiccommons.columbia.edu/catalog/ac:140178
Rahman, Inam Ur; Drori, Iddo; Stodden, Victoria C.; Donoho, David L.; Schroeder, Peterhttp://hdl.handle.net/10022/AC:P:11434Tue, 11 Oct 2011 15:45:58 +0000We describe multiscale representations for data observed on equispaced grids and taking values in manifolds such as: the sphere S2, the special orthogonal group SO(3), the positive definite matrices SPD(n), and the Grassmann manifolds G(n, k). The representations are based on the deployment of Deslauriers-Dubuc and Average Interpolating pyramids "in the tangent plane" of such manifolds, using the Exp and Log maps of those manifolds. The representations provide "wavelet coefficients" which can be thresholded, quantized, and scaled much as traditional wavelet coefficients. Tasks such as compression, noise removal, contrast enhancement, and stochastic simulation are facilitated by this representation. The approach applies to general manifolds, but is particularly suited to the manifolds we consider, i.e. Riemanian symmetric spaces, such as Sn−1, SO(n), G(n, k), where the Exp and Log maps are effectively computable. Applications to manifold-valued data sources of a geometric nature (motion, orientation, diffusion) seem particularly immediate. A software toolbox, SymmLab, can reproduce the results discussed in this paper.Statisticsvcs2115StatisticsArticlesWhen Does Non-Negative Matrix Factorization Give a Correct Decomposition into Parts?
http://academiccommons.columbia.edu/catalog/ac:140175
Donoho, David L.; Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:11433Tue, 11 Oct 2011 15:32:23 +0000We interpret non-negative matrix factorization geometrically, as the problem of finding a simplicial cone which contains a cloud of data points and which is contained in the positive orthant. We show that under certain conditions, basically requiring that some of the data are spread across the faces of the positive orthant, there is a unique such simplicial cone. We give examples of synthetic image articulation databases which obey these conditions; these require separated support and factorial sampling. For such databases there is a generative model in terms of "parts" and NMF correctly identifies the "parts". We show that our theoretical results are predictive of the performance of published NMF code, by running the published algorithms on one of our synthetic image articulation databases.Statisticsvcs2115StatisticsArticlesFast l1 Minimization for Genomewide Analysis of mRNA Lengths
http://academiccommons.columbia.edu/catalog/ac:140172
Drori, Iddo; Stodden, Victoria C.; Hurowitz, Evan H.Tue, 11 Oct 2011 15:19:48 +0000Application of the virtual northern method to human mRNA allows us to systematically measure transcript length on a genome-wide scale [1]. Characterization of RNA transcripts by length provides a measurement which complements cDNA sequencing. We have robustly extracted the lengths of the transcripts expressed by each gene for comparison with the Unigene, Refseq, and H-Invitational databases [2, 3]. Obtaining an accurate probability for each peak requires performing multiple bootstrap simulations, each involving a deconvolution operation which is equivalent to finding the sparsest non-negative solution of an underdetermined system of equations. This process is computationally intensive for a large number of simulations and genes. In this contribution we present an efficient approximation method which is faster than general purpose solvers by two orders of magnitude, and in practice reduces our processing time from a week to hours.Genetics, Statisticsvcs2115StatisticsArticlesBreakdown Point of Model Selection When the Number of Variables Exceeds the Number of Observations
http://academiccommons.columbia.edu/catalog/ac:140168
Donoho, David L.; Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:11431Tue, 11 Oct 2011 15:07:17 +0000The classical multivariate linear regression problem assumes p variables X1, X2, ... , Xp and a response vector y, each with n observations, and a linear relationship between the two: y = X beta + z, where z ~ N(0, sigma2). We point out that when p > n, there is a breakdown point for standard model selection schemes, such that model selection only works well below a certain critical complexity level depending on n/p. We apply this notion to some standard model selection algorithms (Forward Stepwise, LASSO, LARS) in the case where pGtn. We find that 1) the breakdown point is well-de ned for random X-models and low noise, 2) increasing noise shifts the breakdown point to lower levels of sparsity, and reduces the model recovery ability of the algorithm in a systematic way, and 3) below breakdown, the size of coefficient errors follows the theoretical error distribution for the classical linear model.Statisticsvcs2115StatisticsArticlesSparseLab Architecture
http://academiccommons.columbia.edu/catalog/ac:140164
Donoho, David L.; Stodden, Victoria C.; Tsaig, Yaakovhttp://hdl.handle.net/10022/AC:P:11430Tue, 11 Oct 2011 14:54:27 +0000Changes and Enhancements for Release 2.0: 4 papers have been added to SparseLab 2.0: "Fast Solution of l1-norm Minimization Problems When the Solutions May be Sparse"; "Why Simple Shrinkage is Still Relevant For Redundant Representations"; "Stable Recovery of Sparse Overcomplete Representations in the Presence of Noise"; "On the Stability of Basis Pursuit in the Presence of Noise." This document describes the architecture of SparseLab version 2.0. It is designed for users who already have had day-to-day interaction with the package and now need specific details about the architecture of the package, for example to modify components for their own research.Technical communication, Computer sciencevcs2115StatisticsTechnical reportsAbout SparseLab
http://academiccommons.columbia.edu/catalog/ac:140160
Donoho, David L.; Stodden, Victoria C.; Tsaig, Yaakovhttp://hdl.handle.net/10022/AC:P:11429Tue, 11 Oct 2011 14:42:12 +0000Changes and Enhancements for Release 2.0: 4 papers have been added to SparseLab 200: "Fast Solution of l1-norm Minimization Problems When the Solutions May be Sparse"; "Why Simple Shrinkage is Still Relevant For Redundant Representations"; "Stable Recovery of Sparse Overcomplete Representations in the Presence of Noise"; "On the Stability of Basis Pursuit in the Presence of Noise." SparseLab is a library of Matlab routines for finding sparse solutions to underdetermined systems. The library is available free of charge over the Internet. Versions are provided for Macintosh, UNIX and Windows machines. Downloading and installation instructions are given here. SparseLab has over 400 .m files which are documented, indexed and cross-referenced in various ways. In this document we suggest several ways to get started using SparseLab: (a) trying out the pedagogical examples, (b) running the demonstrations, which illustrate the use of SparseLab in published papers, and (c) browsing the extensive collection of source files, which are self-documenting. SparseLab makes available, in one package, all the code to reproduce all the figures in the included published articles. The interested reader can inspect the source code to see exactly what algorithms were used, and how parameters were set in producing our figures, and can then modify the source to produce variations on our results. SparseLab has been developed, in part, because of exhortations by Jon Claerbout of Stanford that computational scientists should engage in "really reproducible" research. This document helps with installation and getting started, as well as describing the philosophy, limitations and rules of the road for this software.Technical communication, Computer sciencevcs2115StatisticsTechnical reportsVirtual Northern Analysis of the Human Genome
http://academiccommons.columbia.edu/catalog/ac:140156
Hurowitz, Evan H.; Drori, Iddo; Stodden, Victoria C.; Brown, Patrick O.; Donoho, David L.http://hdl.handle.net/10022/AC:P:11428Tue, 11 Oct 2011 14:27:15 +0000We applied the Virtual Northern technique to human brain mRNA to systematically measure human mRNA transcript lengths on a genome-wide scale. We used separation by gel electrophoresis followed by hybridization to cDNA microarrays to measure 8,774 mRNA transcript lengths representing at least 6,238 genes at high (>90%) confidence. By comparing these transcript lengths to the Refseq and H-Invitational full-length cDNA databases, we found that nearly half of our measurements appeared to represent novel transcript variants. Comparison of length measurements determined by hybridization to different cDNAs derived from the same gene identified clones that potentially correspond to alternative transcript variants. We observed a close linear relationship between ORF and mRNA lengths in human mRNAs, identical in form to the relationship we had previously identified in yeast. Some functional classes of protein are encoded by mRNAs whose untranslated regions (UTRs) tend to be longer or shorter than average; these functional classes were similar in both human and yeast. Human transcript diversity is extensive and largely unannotated. Our length dataset can be used as a new criterion for judging the completeness of cDNAs and annotating mRNA sequences. Similar relationships between the lengths of the UTRs in human and yeast mRNAs and the functions of the proteins they encode suggest that UTR sequences serve an important regulatory role among eukaryotes.Genetics, Molecular biologyvcs2115StatisticsArticlesThe Legal Framework for Reproducible Scientific Research: Licensing and Copyright
http://academiccommons.columbia.edu/catalog/ac:140153
Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:11427Tue, 11 Oct 2011 13:53:13 +0000As computational researchers increasingly make their results available in a reproducible way, and often outside the traditional journal publishing mechanism, questions naturally arise with regard to copyright, subsequent use and citation, and ownership rights in general. The growing number of scientists who release their research publicly face a gap in the current licensing and copyright structure, particularly on the Internet. Scientific research produces more than the final paper: The code, data structures, experimental design and parameters, documentation, and figures are all important for scholarship communication and result replication. The author proposes the reproducible research standard for scientific researchers to use for all components of their scholarship that should encourage reproducible scientific investigation through attribution, facilitate greater collaboration, and promote engagement of the larger community in scientific learning and discovery.Technical communication, Intellectual propertyvcs2115StatisticsArticlesReproducible Research in Computational Harmonic Analysis
http://academiccommons.columbia.edu/catalog/ac:140150
Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:11426Tue, 11 Oct 2011 13:47:29 +0000Scientific computation is emerging as absolutely central to the scientific method. Unfortunately, it's error-prone and currently immature—traditional scientific publication is incapable of finding and rooting out errors in scientific computation—which must be recognized as a crisis. An important recent development and a necessary response to the crisis is reproducible computational research in which researchers publish the article along with the full computational environment that produces the results. The authors have practiced reproducible computational research for 15 years and have integrated it with their scientific research and with doctoral and postdoctoral education. In this article, they review their approach and how it has evolved over time, discussing the arguments for and against working reproducibly.Technical communication, Information sciencevcs2115StatisticsArticlesEnabling Reproducible Research: Open Licensing for Scientific Innovation
http://academiccommons.columbia.edu/catalog/ac:140147
Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:11425Tue, 11 Oct 2011 13:17:47 +0000There is a gap in the current licensing and copyright structure for the growing number of scientists releasing their research publicly, particularly on the Internet. Scientific research produces more scholarship than the final paper: for example, the code, data structures, experimental design and parameters, documentation, and figures, are all important both for communication of the scholarship and replication of the results. US copyright law is a barrier to the sharing of scientific scholarship since it establishes exclusive rights for creators over their work, thereby limiting the ability of others to copy, use, build upon, or alter the research. This is precisely opposite to prevailing scientific norms, which provide both that results be replicated before accepted as knowledge, and that scientific understanding be built upon previous discoveries for which authorship recognition is given. In accordance with these norms and to encourage the release of all scientific scholarship, I propose the Reproducible Research Standard (RRS) both to ensure attribution and facilitate the sharing of scientific works. Using the RRS on all components of scientific scholarship will encourage reproducible scientific investigation, facilitate greater collaboration, and promote engagement of the larger community in scientific learning and discovery.Technical communication, Intellectual propertyvcs2115StatisticsArticlesA Global Empirical Evaluation of New Communication Technology Use and Democratic Tendency
http://academiccommons.columbia.edu/catalog/ac:140144
Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:11424Tue, 11 Oct 2011 12:25:25 +0000Is the dramatic increase in Internet use associated with a commensurate rise in democracy? Few previous studies have drawn on multiple perception-based measures of governance to assess the Internets effects on the process of democratization. This paper uses perception-based time series data on "Voice & Accountability," "Political Stability," and "Rule of Law" to provide insights into democratic tendency. The results of regression analysis suggest that the level of "Voice & Accountability" in a country increases with Internet use, while the level of "Political Stability" decreases with increasing Internet use. Additionally, Internet use was found to increase significantly for countries with increasing levels of "Voice & Accountability" In contrast, "Rule of Law" was not significantly affected by a country's level of Internet use. Increasing cell phone use did not seem to affect either "Voice & Accountability", "Political Stability" or "Rule of Law." In turn, cell phone use was not affected by any of these three measures of democratic tendency. When limiting our analysis to autocratic regimes, we noted a significant negative effect of Internet and cell phone use on "Political Stability" and found that the "Rule of Law" and "Political Stability" metrics drove ICT adoption.Web studies, Political sciencevcs2115StatisticsArticlesOpen science: policy implications for the evolving phenomenon of user-led scientific innovation
http://academiccommons.columbia.edu/catalog/ac:140127
Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:11419Mon, 10 Oct 2011 16:21:09 +0000From contributions of astronomy data and DNA sequences to disease treatment research, scientific activity by non-scientists is a real and emergent phenomenon, and raising policy questions. This involvement in science can be understood as an issue of access to publications, code, and data that facilitates public engagement in the research process, thus appropriate policy to support the associated welfare enhancing benefits is essential. Current legal barriers to citizen participation can be alleviated by scientists' use of the "Reproducible Research Standard," thus making the literature, data, and code associated with scientific results accessible. The enterprise of science is undergoing deep and fundamental changes, particularly in how scientists obtain results and share their work: the promise of open research dissemination held by the Internet is gradually being fulfilled by scientists. Contributions to science from beyond the ivory tower are forcing a rethinking of traditional models of knowledge generation, evaluation, and communication. The notion of a scientific "peer" is blurred with the advent of lay contributions to science raising questions regarding the concepts of peer-review and recognition. New collaborative models are emerging around both open scientific software and the generation of scientific discoveries that bear a similarity to open innovation models in other settings. Public engagement in science can be understood as an issue of access to knowledge for public involvement in the research process, facilitated by appropriate policy to support the welfare enhancing benefits deriving from citizen-science.Technical communication, Information sciencevcs2115StatisticsArticlesReproducible Research: Addressing the Need for Data and Code Sharing in Computational Science
http://academiccommons.columbia.edu/catalog/ac:140124
Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:11418Mon, 10 Oct 2011 16:05:57 +0000Roundtable participants identified ways of making computational research details readily available, which is a crucial step in addressing the current credibility crisis.Technical communication, Information sciencevcs2115StatisticsArticlesThe Scientific Method in Practice: Reproducibility in the Computational Sciences
http://academiccommons.columbia.edu/catalog/ac:140117
Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:11417Mon, 10 Oct 2011 15:53:35 +0000Since the 1660's the scientific method has included reproducibility as a mainstay in its effort to root error from scientific discovery. With the explosive growth of digitization in scientific research and communication, it is easier than ever to satisfy this requirement. In computational research experimental details and methods can be recorded in code and scripts, data is digital, papers are frequently online, and the result is the potential for "really reproducible research." Imagine the ability to routinely inspect code and data and recreate others' results: Every step taken to achieve the findings can potentially be transparent. Now imagine anyone with an Internet connection and the capability of running the code being able to do this. This paper investigates the obstacles blocking the sharing of code and data to understand conditions under which computational scientists reveal their full research compendium. A survey of registrants at a top machine learning conference (NIPS) was used to discover the strength of underlying factors that affect the decision to reveal code, data, and ideas. Sharing of code and data is becoming more common as about a third of respondents post some on their websites, and about 85% self report to have some code or data publicly available on the web. Contrary to theoretical expectations, the decision to share work is grounded in communitarian norms, although when work remains hidden private incentives dominate the decision. We find that code, data, and ideas are each regarded differently in terms of how they are revealed and that guidance from scientific norms varies with pervasiveness of computation in the field. The largest barriers to sharing are time involved in preparation of work and the legal Intellectual Property framework scientists face. This paper does two things. It provides evidence in the debate about whether scientists' research revealing behavior is wholly governed by considerations of personal impact or whether the reasoning behind the revealing decision involves larger scientific ideals, and secondly, this research describes the actual sharing behavior in the Machine Learning community.Technical communication, Computer sciencevcs2115StatisticsWorking papersRemarks presented before the National Academies Committee on the Impact of Copyright Policy on Innovation in the Digital Era
http://academiccommons.columbia.edu/catalog/ac:140113
Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:11416Mon, 10 Oct 2011 15:42:23 +0000Thank you for the opportunity to address this committee at the National Academy of Science. You are uniquely positioned to contend with the barriers to innovation that arise through the impact of copyright law on scientific integrity. In my remarks I hope to convince you of the urgent need for the Committee to redress these barriers directly by recommending open licensing for scientific works, in particular code and data. Copyright law works counter to scientific progress, with enormous impact on innovation both inside and outside the scientific enterprise.Technical communication, Intellectual propertyvcs2115StatisticsPresentationsCyber Science and Engineering: A Report of the National Science Foundation Advisory Committee for Cyberinfrastructure Task Force on Grand Challenges
http://academiccommons.columbia.edu/catalog/ac:140109
Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:11415Mon, 10 Oct 2011 15:08:10 +0000This document contains the findings and recommendations of the NSF – Advisory Committee for Cyberinfrastructure Task Force on Grand Challenges addressed by advances in Cyber Science and Engineering. The term Cyber Science and Engineering (CS&E) is introduced to describe the intellectual discipline that brings together core areas of science and engineering, computer science, and computational and applied mathematics in a concerted effort to use the cyberinfrastructure (CI) for scientific discovery and engineering innovations; CS&E is computational and data-based science and engineering enabled by CI. The report examines a host of broad issues faced in addressing the Grand Challenges of science and technology and explores how those can be met by advances in CI. Included in the report are recommendations for new programs and initiatives that will expand the portfolio of the Office of Cyberinfrastructure and that will be critical to advances in all areas of science and engineering that rely on the CI.Technical communication, Information sciencevcs2115StatisticsReportsIntellectual Contributions to Digitized Science: Implementing the Scientific Method
http://academiccommons.columbia.edu/catalog/ac:139738
Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:11401Thu, 06 Oct 2011 12:08:53 +0000Technical communication, Information sciencevcs2115StatisticsPresentationsBasics of Intellectual Property for Computational Scientists
http://academiccommons.columbia.edu/catalog/ac:139735
Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:11400Thu, 06 Oct 2011 12:04:21 +0000Presented at the Applied Mathematics Perspectives workshop, "Reproducible Research: Tools and Strategies for Scientific Computing," Vancouver, B.C., July 13-16, 2011.Technical communication, Information sciencevcs2115StatisticsPresentationsFunding Agency Policy and the Credibility Crisis in Computational Science
http://academiccommons.columbia.edu/catalog/ac:139732
Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:11399Thu, 06 Oct 2011 11:56:41 +0000Technical communication, Information sciencevcs2115StatisticsPresentationsWhat is Reproducible Research? The Practice of Science Today and the Scientific Method
http://academiccommons.columbia.edu/catalog/ac:139728
Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:11398Thu, 06 Oct 2011 11:48:22 +0000Technical communication, Information sciencevcs2115StatisticsPresentations