Academic Commons Search Results
http://academiccommons.columbia.edu/catalog.rss?f%5Bdepartment_facet%5D%5B%5D=Statistics&q=&rows=500&sort=record_creation_date+desc
Academic Commons Search Resultsen-usA Point Process Model for the Dynamics of Limit Order Books
http://academiccommons.columbia.edu/catalog/ac:171221
Vinkovskaya, Ekaterinahttp://dx.doi.org/10.7916/D88913WWFri, 28 Feb 2014 16:44:16 +0000This thesis focuses on the statistical modeling of the dynamics of limit order books in electronic equity markets. The statistical properties of events affecting a limit order book -market orders, limit orders and cancellations- reveal strong evidence of clustering in time, cross-correlation across event types and dependence of the order flow on the bid-ask spread. Further investigation reveals the presence of a self-exciting property - that a large number of events in a given time period tends to imply a higher probability of observing a large number of events in the following time period. We show that these properties may be adequately represented by a multivariate self-exciting point process with multiple regimes that reflect changes in the bid-ask spread.
We propose a tractable parametrization of the model and perform a Maximum Likelihood Estimation of the model using high-frequency data from the Trades and Quotes database for US stocks. We show that the model may be used to obtain predictions of order flow and that its predictive performance beats the Poisson model as well as Moving Average and Auto Regressive time series models.StatisticsStatisticsDissertationsMixed Methods for Mixed Models
http://academiccommons.columbia.edu/catalog/ac:169644
Dorie, Vincent J.http://dx.doi.org/10.7916/D8V40S5XWed, 22 Jan 2014 14:28:18 +0000This work bridges the frequentist and Bayesian approaches to mixed models by borrowing the best features from both camps: point estimation procedures are combined with priors to obtain accurate, fast inference while posterior simulation techniques are developed that approximate the likelihood with great precision for the purposes of assessing uncertainty. These allow flexible inferences without the need to rely on expensive Markov chain Monte Carlo simulation techniques. Default priors are developed and evaluated in a variety of simulation and real-world settings with the end result that we propose a new set of standard approaches that yield superior performance at little computational cost.StatisticsStatisticsDissertationsKernel-based association measures
http://academiccommons.columbia.edu/catalog/ac:167034
Liu, Yinghttp://hdl.handle.net/10022/AC:P:22154Thu, 07 Nov 2013 15:12:35 +0000Measures of associations have been widely used for describing the statistical relationships between two sets of variables. Traditional association measures tend to focus on specialized settings (specific types of variables or association patterns). Based on an in-depth summary of existing measures, we propose a general framework for association measures unifying existing methods and novel extensions based on kernels, including practical solutions to computational challenges. The proposed framework provides improved feature selection and extensions to a variety of current classifiers. Specifically, we introduce association screening and variable selection via maximizing kernel-based association measures. We also develop a backward dropping procedure for feature selection when there are a large number of candidate variables. We evaluate our framework using a wide variety of both simulated and real data. In particular, we conduct independence tests and feature selection using kernel association measures on diversified association patterns of different dimensions and variable types. The results show the superiority of our methods to existing ones. We also apply our framework to four real-word problems, three from statistical genetics and one of gender prediction from handwriting. We demonstrate through these applications both the de novo construction of new kernels and the adaptation of existing kernels tailored to the data at hand, and how kernel-based measures of associations can be naturally applied to different data structures including functional input and output spaces. This shows that our framework can be applied to a wide range of real world problems and work well in practice.Statistics, Computer scienceyl2802StatisticsDissertationsLow-rank graphical models and Bayesian inference in the statistical analysis of noisy neural data
http://academiccommons.columbia.edu/catalog/ac:166472
Smith, Carl Alexanderhttp://hdl.handle.net/10022/AC:P:21991Fri, 11 Oct 2013 16:56:29 +0000We develop new methods of Bayesian inference, largely in the context of analysis of neuroscience data. The work is broken into several parts. In the first part, we introduce a novel class of joint probability distributions in which exact inference is tractable. Previously it has been difficult to find general constructions for models in which efficient exact inference is possible, outside of certain classical cases. We identify a class of such models that are tractable owing to a certain "low-rank" structure in the potentials that couple neighboring variables. In the second part we develop methods to quantify and measure information loss in analysis of neuronal spike train data due to two types of noise, making use of the ideas developed in the first part. Information about neuronal identity or temporal resolution may be lost during spike detection and sorting, or precision of spike times may be corrupted by various effects. We quantify the information lost due to these effects for the relatively simple but sufficiently broad class of Markovian model neurons. We find that decoders that model the probability distribution of spike-neuron assignments significantly outperform decoders that use only the most likely spike assignments. We also apply the ideas of the low-rank models from the first section to defining a class of prior distributions over the space of stimuli (or other covariate) which, by conjugacy, preserve the tractability of inference. In the third part, we treat Bayesian methods for the estimation of sparse signals, with application to the locating of synapses in a dendritic tree. We develop a compartmentalized model of the dendritic tree. Building on previous work that applied and generalized ideas of least angle regression to obtain a fast Bayesian solution to the resulting estimation problem, we describe two other approaches to the same problem, one employing a horseshoe prior and the other using various spike-and-slab priors. In the last part, we revisit the low-rank models of the first section and apply them to the problem of inferring orientation selectivity maps from noisy observations of orientation preference. The relevant low-rank model exploits the self-conjugacy of the von Mises distribution on the circle. Because the orientation map model is loopy, we cannot do exact inference on the low-rank model by the forward backward algorithm, but block-wise Gibbs sampling by the forward backward algorithm speeds mixing. We explore another von Mises coupling potential Gibbs sampler that proves to effectively smooth noisily observed orientation maps.Statistics, Neurosciencescas2207Chemistry, StatisticsDissertationsThe Challenge of Communicating Computational Research
http://academiccommons.columbia.edu/catalog/ac:165636
Hong, Neil Chue; Jockers, Matthew L.; Ellis, Daniel P. W.; Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:21703Fri, 20 Sep 2013 11:25:29 +0000Computational approaches to scholarship have revolutionized how research is done but have at the same time complicated the process of disseminating the results of that research. Conclusions may be produced using mathematical models or custom software that are not easily accessible to, or reproducible by, those outside the research team. And in some fields, a lack of understanding of computational approaches may lead to skepticism about their use. The panel considers urgent questions faced by researchers across the range of academic disciplines. How can scientists and social scientists address the lack of access to the software and code used to produce many research results, which has led to a crisis of verifiability and concern about the accuracy of the scientific record? How can digital humanists approach discussions of computational methods, which may not fit into traditional forms of scholarship and can be viewed with suspicion in disciplines that prize the art of scholarly analysis? Computational researchers are examining communication practices, policies, and tools that promise to more effectively convey their research process and the results it produces. The panelists are: Neil Chue Hong, Director of the Software Sustainability Institute; Matthew L. Jockers, Assistant Professor of English at the University of Nebraska-Lincoln; and Daniel P. W. Ellis, Associate Professor of Electrical Engineering at Columbia University.Technical communication, Information sciencede171, vcs2115Electrical Engineering, Statistics, Center for Digital Research and Scholarship, Scholarly Communication Program, Libraries and Information ServicesInterviews and roundtablesMeasuring Scholarly Impact: The Influence of 'Altmetrics'
http://academiccommons.columbia.edu/catalog/ac:165365
Priem, Jason; Holmes, Kristi; Trasande, Caitlin Aptowicz; Gelman, Andrew E.http://hdl.handle.net/10022/AC:P:21698Fri, 20 Sep 2013 10:24:48 +0000"Altmetrics" refers to methods of measuring scholarly impact using Web-based social media. Why does it matter? In many academic fields, attaining scholarly prestige means publishing research articles in important scholarly journals. However, many in the academic community consider a journal's prestige, which is determined by a metric calculated using the number of citations to the journal, to be a poor proxy for the quality of the individual author's work. At the same time, hiring and promotion committees are looking for ways to determine the impact of alternate formats now commonly used by researchers such as blogs, data sets, videos, and social media. The panelists all work with innovative new tools for assessing scholarly impact. They are: Jason Priem, Co-Founder, ImpactStory; Kristi Holmes, Bioinformaticist, Bernard Becker Medical Library, Washington University in St. Louis School of Medicine; and Caitlin Aptowicz Trasande, Head of Science Metrics, Digital Science.Information science, Information technologyag389Statistics, Center for Digital Research and Scholarship, Scholarly Communication Program, Libraries and Information ServicesInterviews and roundtablesGeneralized Volatility-Stabilized Processes
http://academiccommons.columbia.edu/catalog/ac:165162
Pickova, Radkahttp://hdl.handle.net/10022/AC:P:21616Fri, 13 Sep 2013 15:07:49 +0000In this thesis, we consider systems of interacting diffusion processes which we call Generalized Volatility-Stabilized processes, as they extend the Volatility-Stabilized Market models introduced in Fernholz and Karatzas (2005). First, we show how to construct a weak solution of the underlying system of stochastic differential equations. In particular, we express the solution in terms of time-changed squared-Bessel processes and argue that this solution is unique in distribution. In addition, we also discuss sufficient conditions under which this solution does not explode in finite time, and provide sufficient conditions for pathwise uniqueness and for existence of a strong solution.
Secondly, we discuss the significance of these processes in the context of Stochastic Portfolio Theory. We describe specific market models which assume that the dynamics of the stocks' capitalizations is the same as that of the Generalized Volatility-Stabilized processes, and we argue that strong relative arbitrage opportunities may exist in these markets, specifically, we provide multiple examples of portfolios that outperform the market portfolio. Moreover, we examine the properties of market weights as well as the diversity weighted portfolio in these models.
Thirdly, we provide some asymptotic results for these processes which allows us to describe different properties of the corresponding market models based on these processes.Statisticsrp2424Statistics, MathematicsDissertationsRe-use and Reproducibility: Opportunities and Challenges
http://academiccommons.columbia.edu/catalog/ac:162944
Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:20964Tue, 09 Jul 2013 09:37:23 +0000To support the reliability and accuracy of the scientific record, science policy, research infrastructure, and the culture of science must facilitate the sharing of data and code resulting from scientific research, much of which is now produced using computational methods. Though the need to support the reproducibility of computational research is now widely recognized, copyright and other factors present challenges to the development of policies and practices.Technical communication, Information technologyvcs2115StatisticsPresentationsVariability of Universal Life Cash Flows under Higher Risk Investment Strategies
http://academiccommons.columbia.edu/catalog/ac:162700
Tayal, Abhishek; Yang, Canning; Dunn, Thomas P.http://hdl.handle.net/10022/AC:P:20851Thu, 27 Jun 2013 16:09:42 +0000This integrated project studied the offsetting elements of higher nominal yields, greater credit loss expectations, and higher capital requirements on the profitability of the life insurer that pursues a higher yield investment strategy. Profitability measures were developed for a Universal Life product. The report provides an attribution of profit drivers for the insurer. The effects of credit rating migration on credit loss rates and bond capital charges were examined, and investment strategies were tested under credit stress scenarios.Financeat2842, cy2315, tpd2111Actuarial Sciences, StatisticsReportsCredit Risk Modeling and Analysis Using Copula Method and Changepoint Approach to Survival Data
http://academiccommons.columbia.edu/catalog/ac:161682
Qian, Bohttp://hdl.handle.net/10022/AC:P:20510Thu, 30 May 2013 16:36:22 +0000This thesis consists of two parts. The first part uses Gaussian Copula and Student's t Copula as the main tools to model the credit risk in securitizations and re-securitizations. The second part proposes a statistical procedure to identify changepoints in Cox model of survival data. The recent 2007-2009 financial crisis has been regarded as the worst financial crisis since the Great Depression by leading economists. The securitization sector took a lot of blame for the crisis because of the connection of the securitized products created from mortgages to the collapse of the housing market. The first part of this thesis explores the relationship between securitized mortgage products and the 2007-2009 financial crisis using the Copula method as the main tool. We show in this part how loss distributions of securitizations and re-securitizations can be derived or calculated in a new model. Simulations are conducted to examine the effectiveness of the model. As an application, the model is also used to examine whether and where the ratings of securitized products could be flawed. On the other hand, the lag effect and saturation effect problems are common and important problems in survival data analysis. They belong to a general class of problems where the treatment effect takes occasional jumps instead of staying constant throughout time. Therefore, they are essentially the changepoint problems in statistics. The second part of this thesis focuses on extending Lai and Xing's recent work in changepoint modeling, which was developed under a time series and Bayesian setup, to the lag effect problems in survival data. A general changepoint approach for Cox model is developed. Simulations and real data analyses are conducted to illustrate the effectiveness of the procedure and how it should be implemented and interpreted.Statisticsbq2102StatisticsDissertationsWhy Public Access to Data is So Important (and why getting the policy right is even more so)
http://academiccommons.columbia.edu/catalog/ac:161424
Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:20387Tue, 21 May 2013 12:08:29 +0000Open data is crucial to science today. Computation is becoming central to scientific research. “Open Data” is not well-defined. Scope: Share data and code that permit others in the field to replicate published results. (traditionally done by the publication alone).Information technology, Technical communicationvcs2115StatisticsPresentationsOn optimal arbitrage under constraints
http://academiccommons.columbia.edu/catalog/ac:160495
Sadhukhan, Subhankarhttp://hdl.handle.net/10022/AC:P:20076Wed, 01 May 2013 11:07:50 +0000In this thesis, we investigate the existence of relative arbitrage opportunities in a Markovian model of a financial market, which consists of a bond and stocks, whose prices evolve like Itô processes. We consider markets where investors are constrained to choose from among a restricted set of investment strategies. We show that the upper hedging price of (i.e. the minimum amount of wealth needed to superreplicate) a given contingent claim in a constrained market can be expressed as the supremum of the fair price of the given contingent claim under certain unconstrained auxiliary Markovian markets. Under suitable assumptions, we further characterize the upper hedging price as viscosity solution to certain variational inequalities. We, then, use this viscosity solution characterization to study how the imposition of stricter constraints on the market affect the upper hedging price. In particular, if relative arbitrage opportunities exist with respect to a given strategy, we study how stricter constraints can make such arbitrage opportunities disappear.Applied mathematics, Financess3240Statistics, MathematicsDissertationsStatistical Inference for Diagnostic Classification Models
http://academiccommons.columbia.edu/catalog/ac:160464
Xu, Gongjunhttp://hdl.handle.net/10022/AC:P:20058Tue, 30 Apr 2013 16:06:11 +0000Diagnostic classification models (DCM) are an important recent development in educational and psychological testing. Instead of an overall test score, a diagnostic test provides each subject with a profile detailing the concepts and skills (often called "attributes") that he/she has mastered. Central to many DCMs is the so-called Q-matrix, an incidence matrix specifying the item-attribute relationship. It is common practice for the Q-matrix to be specified by experts when items are written, rather than through data-driven calibration. Such a non-empirical approach may lead to misspecification of the Q-matrix and substantial lack of model fit, resulting in erroneous interpretation of testing results. This motivates our study and we consider the identifiability, estimation, and hypothesis testing of the Q-matrix. In addition, we study the identifiability of diagnostic model parameters under a known Q-matrix. The first part of this thesis is concerned with estimation of the Q-matrix. In particular, we present definitive answers to the learnability of the Q-matrix for one of the most commonly used models, the DINA model, by specifying a set of sufficient conditions under which the Q-matrix is identifiable up to an explicitly defined equivalence class. We also present the corresponding data-driven construction of the Q-matrix. The results and analysis strategies are general in the sense that they can be further extended to other diagnostic models. The second part of the thesis focuses on statistical validation of the Q-matrix. The purpose of this study is to provide a statistical procedure to help decide whether to accept the Q-matrix provided by the experts. Statistically, this problem can be formulated as a pure significance testing problem with null hypothesis H0 : Q = Q0, where Q0 is the candidate Q-matrix. We propose a test statistic that measures the consistency of observed data with the proposed Q-matrix. Theoretical properties of the test statistic are studied. In addition, we conduct simulation studies to show the performance of the proposed procedure. The third part of this thesis is concerned with the identifiability of the diagnostic model parameters when the Q-matrix is correctly specified. Identifiability is a prerequisite for statistical inference, such as parameter estimation and hypothesis testing. We present sufficient and necessary conditions under which the model parameters are identifiable from the response data.Statistics, Educational tests and measurementsgx2108StatisticsDissertationsTestimony submitted to the House Committee on Science, Space and Technology for the March 5, 2013 hearing on Scientific Integrity and Transparency.
http://academiccommons.columbia.edu/catalog/ac:157889
Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:19394Thu, 21 Mar 2013 16:23:38 +0000Reproducibility is a new challenge, brought about by advances in scientific research capability due to immense changes in technology over the last two decades. It is widely recognized as a defining hallmark of science and directly impacts the transparency and reliability of findings, and is taken very seriously by the scientific community.Technical communication, Information technologyvcs2115StatisticsPresentationsOpen Data, Open Methods, and the Promise of Large Scale Validation.
http://academiccommons.columbia.edu/catalog/ac:157883
Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:19393Thu, 21 Mar 2013 16:15:09 +0000Reproducibility is core to science, and a critical issue in computational science,Technical communication, Information technologyvcs2115StatisticsPresentationsDigital Scholarship in Scientific Research: Open Questions in Reproducibility and Curation.
http://academiccommons.columbia.edu/catalog/ac:157879
Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:19392Thu, 21 Mar 2013 16:08:58 +0000Computation presents only a potential third branch of the scientific
method.Technical communication, Information technologyvcs2115StatisticsPresentationsTechnology and the Scientific Method: The Credibility Crisis in Computational Science
http://academiccommons.columbia.edu/catalog/ac:157876
Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:19391Thu, 21 Mar 2013 16:00:08 +0000Computation presents only a potential third branch of the scientific
method.Technical communication, Information technologyvcs2115StatisticsPresentationsFacilitating Reproducibility: Open Data and Code in Economics
http://academiccommons.columbia.edu/catalog/ac:157873
Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:19390Thu, 21 Mar 2013 15:48:04 +0000The aim of the workshop is to build an understanding of the value of open data and open tools for the Economics profession and the obstacles to opening up information, as well as the role of greater openness in broadening understanding of and engagement with Economics among the wider community including policy-makers and society.Technical communication, Economicsvcs2115StatisticsPresentationsBayesian Model Selection in terms of Kullback-Leibler discrepancy
http://academiccommons.columbia.edu/catalog/ac:158374
Zhou, Shouhaohttp://hdl.handle.net/10022/AC:P:19157Mon, 25 Feb 2013 13:36:40 +0000In this article we investigate and develop the practical model assessment and selection methods for Bayesian models, when we anticipate that a promising approach should be objective enough to accept, easy enough to understand, general enough to apply, simple enough to compute and coherent enough to interpret. We mainly restrict attention to the Kullback-Leibler divergence, a widely applied model evaluation measurement to quantify the similarity between the proposed candidate model and the underlying true model, where the true model is only referred to a probability distribution as the best projection onto the statistical modeling space once we try to understand the real but unknown dynamics/mechanism of interest. In addition to review and discussion on the advantages and disadvantages of the historically and currently prevailing practical model selection methods in literature, a series of convenient and useful tools, each designed and applied for different purposes, are proposed to asymptotically unbiasedly assess how the candidate Bayesian models are favored in terms of predicting a future independent observation. What's more, we also explore the connection of the Kullback-Leibler based information criterion to the Bayes factors, another most popular Bayesian model comparison approaches, after seeing the motivation through the developments of the Bayes factor variants. In general, we expect to provide a useful guidance for researchers who are interested in conducting Bayesian data analysis.Statisticssz2020StatisticsDissertationsMultiplicative Multiresolution Analysis for Lie-group Valued Data Indexed by a Euclidean Parameter
http://academiccommons.columbia.edu/catalog/ac:155756
Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:15397Wed, 12 Dec 2012 15:17:09 +0000Lie-valued euclidean indexed data. These data might be: phase angles as functions of time or space, for example compass directions; 3D orientations of a rigid frame of reference as a function of time or space; or, quaternions as a function of time or space. This can also be extended to quotients of lie groups which gives us the ability to model points on S2, the unit sphere, as functions of time or space.Computer science, Statisticsvcs2115StatisticsPresentationsA Brief History of the Reproducibility Movement
http://academiccommons.columbia.edu/catalog/ac:155759
Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:15396Wed, 12 Dec 2012 14:51:16 +0000Computational science cannot be elevated to a third branch of the scientific method until it generates routinely verifiable knowledge.Technical communication, Computer sciencevcs2115StatisticsPresentationsTransparency in Computational Science
http://academiccommons.columbia.edu/catalog/ac:154852
Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:15360Tue, 27 Nov 2012 14:16:44 +0000The central motivation for the scientific method is to root out error: Computational science as practiced today does not generate reliable knowledge. This presentation looks at four possible solutions to the issues of transparency in computational science.Technical communication, Computer sciencevcs2115StatisticsPresentationsDiscussant: “Pornography and Divorce”
http://academiccommons.columbia.edu/catalog/ac:154713
Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:15350Wed, 21 Nov 2012 13:19:39 +0000A presentation on data and design suggestions for research on the topic of pornography and divorce.Technical communication, Intellectual propertyvcs2115StatisticsPresentationsRunMyCode.org: a Novel Dissemination and Collaboration Platform for Executing Published Computational Results
http://academiccommons.columbia.edu/catalog/ac:154716
Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:15349Wed, 21 Nov 2012 13:15:47 +0000A presentation on a collaboration platform for executing published computational results.Technical communication, Intellectual propertyvcs2115StatisticsPresentationsJournal Policy and Reproducible Computational Research
http://academiccommons.columbia.edu/catalog/ac:154719
Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:15348Wed, 21 Nov 2012 13:05:57 +0000Discusses policy possibilities for the issues of reproducibility and dissemination in computational science.Technical communication, Intellectual propertyvcs2115StatisticsPresentationsTowards Reproducible Science: Policy and a Path Forward
http://academiccommons.columbia.edu/catalog/ac:154722
Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:15347Wed, 21 Nov 2012 12:59:32 +0000Discusses solutions and policy possibilities for the issues of reproducibility and dissemination in computational science.Technical communication, Intellectual propertyvcs2115StatisticsPresentationsMultiple Imputation with Diagnostics (mi) in R: Opening Windows into the Black Box
http://academiccommons.columbia.edu/catalog/ac:154731
Su, Yu-Sung; Yajima, Masanao; Gelman, Andrew E.; Hill, Jenniferhttp://hdl.handle.net/10022/AC:P:15342Tue, 20 Nov 2012 16:49:06 +0000Our mi package in R has several features that allow the user to get inside the imputation process and evaluate the reasonableness of the resulting models and imputations. These features include: choice of predictors, models, and transformations for chained imputation models; standard and binned residual plots for checking the fit of the conditional distributions used for imputation; and plots for comparing the distributions of observed and imputed data. In addition, we use Bayesian models and weakly informative prior distributions to construct more stable estimates of imputation models. Our goal is to have a demonstration package that (a) avoids many of the practical problems that arise with existing multivariate imputation programs, and (b) demonstrates state-of-the-art diagnostics that can be applied more generally and can be incorporated into the software of others.Statisticsag389Statistics, Political ScienceArticlesR2WinBUGS: A Package for Running WinBUGS from R
http://academiccommons.columbia.edu/catalog/ac:154734
Sturtz, Sibylle; Ligges, Uwe; Gelman, Andrew E.http://hdl.handle.net/10022/AC:P:15341Tue, 20 Nov 2012 16:42:45 +0000The R2WinBUGS package provides convenient functions to call WinBUGS from R. It automatically writes the data and scripts in a format readable by WinBUGS for processing in batch mode, which is possible since version 1.4. After the WinBUGS process has finished, it is possible either to read the resulting data into R by the package itself—which gives a compact graphical summary of inference and convergence diagnostics—or to use the facilities of the coda package for further analyses of the output. Examples are given to demonstrate the usage of this package.Statisticsag389Statistics, Political ScienceArticlesBayesian Statistical Pragmatism
http://academiccommons.columbia.edu/catalog/ac:154737
Gelman, Andrew E.http://hdl.handle.net/10022/AC:P:15340Tue, 20 Nov 2012 16:38:18 +0000I agree with Rob Kass’ point that we can and should make use of statistical methods developed under different philosophies, and I am happy to take the opportunity to elaborate on some of his arguments.Statisticsag389Statistics, Political ScienceArticlesSegregation in Social Networks Based on Acquaintanceship and Trust
http://academiccommons.columbia.edu/catalog/ac:154740
DiPrete, Thomas A.; Gelman, Andrew E.; McCormick, Tyler; Teitler, Julien O.; Zheng, Tianhttp://hdl.handle.net/10022/AC:P:15339Tue, 20 Nov 2012 16:17:57 +0000Using 2006 General Social Survey data, the authors compare levels of segregation by race and along other dimensions of potential social cleavage in the contemporary United States. Americans are not as isolated as the most extreme recent estimates suggest. However, hopes that “bridging” social capital is more common in broader acquaintanceship networks than in core networks are not supported. Instead, the entire acquaintanceship network is perceived by Americans to be about as segregated as the much smaller network of close ties. People do not always know the religiosity, political ideology, family behaviors, or socioeconomic status of their acquaintances, but perceived social divisions on these dimensions are high, sometimes rivaling racial segregation in acquaintanceship networks. The major challenge to social integration today comes from the tendency of many Americans to isolate themselves from others who differ on race, political ideology, level of religiosity, and other salient aspects of social identity.Statisticstad61, ag389, thm2105, jot8, tz33Sociology, Statistics, Political Science, Social WorkArticlesSoftware Patents as a Barrier to Scientific Transparency: An Unexpected Consequence of Bayh-Dole
http://academiccommons.columbia.edu/catalog/ac:155777
Reich, Isabel Rose ; Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:15328Tue, 20 Nov 2012 14:27:42 +0000Discusses solutions to the reproducibility and dissemination issues in computational science. Examines the interaction between the digitization of science and Intellectual Property Law, specifically the incentives created by the Bayh‐Dole Act to patent inventions associated with university‐based research.Technical communication, Intellectual propertyirr2105, vcs2115Applied Physics and Applied Mathematics, StatisticsPresentationsData-Intensive Science: Methods for Reproducibility and Dissemination
http://academiccommons.columbia.edu/catalog/ac:154952
Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:15327Tue, 20 Nov 2012 14:15:54 +0000Discusses solutions to the reproducibility dissemination issues in computational science.Technical communication, Intellectual propertyvcs2115StatisticsPresentationsThe Reproducible Research Movement: Crisis and Solutions
http://academiccommons.columbia.edu/catalog/ac:155774
Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:15326Tue, 20 Nov 2012 13:56:49 +0000Discusses solutions to the reproducibility of computational research in computational science.Technical communication, Intellectual propertyvcs2115StatisticsPresentationsDisseminating Numerically Reproducible Research
http://academiccommons.columbia.edu/catalog/ac:154846
Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:15325Tue, 20 Nov 2012 13:27:52 +0000Discusses solutions to the reproducible computational research in computational science.Technical communication, Intellectual propertyvcs2115StatisticsPresentationsMethods for studying the neural code in high dimensions
http://academiccommons.columbia.edu/catalog/ac:152510
Ramirez, Alexandro D.http://hdl.handle.net/10022/AC:P:14688Wed, 12 Sep 2012 16:25:41 +0000Over the last two decades technological developments in multi-electrode arrays and fluorescence microscopy have made it possible to simultaneously record from hundreds to thousands of neurons. Developing methods for analyzing these data in order to learn how networks of neurons respond to external stimuli and process information is an outstanding challenge for neuroscience. In this dissertation, I address the challenge of developing and testing models that are both flexible and computationally tractable when used with high dimensional data. In chapter 2 I will discuss an approximation to the generalized linear model (GLM) log-likelihood that I developed in collaboration with my thesis advisor. This approximation is designed to ease the computational burden of evaluating GLMs. I will show that our method reduces the computational cost of evaluating the GLM log-likelihood by a factor proportional to the number of parameters in the model times the number of observations. Therefore it is most beneficial in typical neuroscience applications where the number of parameters is large. I then detail a variety of applications where our method can be of use, including Maximum Likelihood estimation of GLM parameters, marginal likelihood calculations for model selection and Markov chain Monte Carlo methods for sampling from posterior parameter distributions. I go on to show that our model does not necessarily sacrifice accuracy for speed. Using both analytic calculations and multi-unit, primate retinal responses, I show that parameter estimates and predictions using our model can have the same accuracy as that of generalized linear models. In chapter 3 I study the neural decoding problem of predicting stimuli from neuronal responses. The focus is on reconstructing zebra finch song spectrograms, which are high-dimensional, by combining the spike trains of zebra finch auditory midbrain neurons with information about the correlations present in all zebra finch song. I use a GLM to model neuronal responses and a series of prior distributions, each carrying different amounts of statistical information about zebra finch song. For song reconstruction I make use of recent connections made between the applied mathematics literature on solving linear systems of equations involving matrices with special structure and neural decoding. This allowed me to calculate \textit{maximum a posteriori} (MAP) estimates of song spectrograms in a time that only grows linearly, and is therefore quite tractable, with the number of time-bins in the song spectrogram. This speed was beneficial for answering questions which required the reconstruction of a variety of song spectrograms each corresponding to different priors made on the distribution of zebra finch song. My collaborators and I found that spike trains from a population of MLd neurons combined with an uncorrelated Gaussian prior can estimate the amplitude envelope of song spectrograms. The same set of responses can be combined with Gaussian priors that have correlations matched to those found across multiple zebra finch songs to yield song spectrograms similar to those presented to the animal. The fidelity of spectrogram reconstructions from MLd responses relies more heavily on prior knowledge of spectral correlations than temporal correlations. However the best reconstructions combine MLd responses with both spectral and temporal correlations.Neurosciencesadr2110Neurobiology and Behavior, Neuroscience, StatisticsDissertationsModeling Strategies for Large Dimensional Vector Autoregressions
http://academiccommons.columbia.edu/catalog/ac:152472
Zang, Pengfeihttp://hdl.handle.net/10022/AC:P:14666Tue, 11 Sep 2012 15:31:00 +0000The vector autoregressive (VAR) model has been widely used for describing the dynamic behavior of multivariate time series. However, fitting standard VAR models to large dimensional time series is challenging primarily due to the large number of parameters involved. In this thesis, we propose two strategies for fitting large dimensional VAR models. The first strategy involves reducing the number of non-zero entries in the autoregressive (AR) coefficient matrices and the second is a method to reduce the effective dimension of the white noise covariance matrix. We propose a 2-stage approach for fitting large dimensional VAR models where many of the AR coefficients are zero. The first stage provides initial selection of non-zero AR coefficients by taking advantage of the properties of partial spectral coherence (PSC) in conjunction with BIC. The second stage, based on $t$-ratios and BIC, further refines the spurious non-zero AR coefficients post first stage. Our simulation study suggests that the 2-stage approach outperforms Lasso-type methods in discovering sparsity patterns in AR coefficient matrices of VAR models. The performance of our 2-stage approach is also illustrated with three real data examples. Our second strategy for reducing the complexity of a large dimensional VAR model is based on a reduced-rank estimator for the white noise covariance matrix. We first derive the reduced-rank covariance estimator under the setting of independent observations and give the analytical form of its maximum likelihood estimate. Then we describe how to integrate the proposed reduced-rank estimator into the fitting of large dimensional VAR models, where we consider two scenarios that require different model fitting procedures. In the VAR modeling context, our reduced-rank covariance estimator not only provides interpretable descriptions of the dependence structure of VAR processes but also leads to improvement in model-fitting and forecasting over unrestricted covariance estimators. Two real data examples are presented to illustrate these fitting procedures.Statisticspz2146StatisticsDissertationsSome Models for Time Series of Counts
http://academiccommons.columbia.edu/catalog/ac:152149
Liu, Henghttp://hdl.handle.net/10022/AC:P:14561Wed, 29 Aug 2012 14:08:58 +0000This thesis focuses on developing nonlinear time series models and establishing relevant theory with a view towards applications in which the responses are integer valued. The discreteness of the observations, which is not appropriate with classical time series models, requires novel modeling strategies. The majority of the existing models for time series of counts assume that the observations follow a Poisson distribution conditional on an accompanying intensity process that drives the serial dynamics of the model. According to whether the evolution of the intensity process depends on the observations or solely on an external process, the models are classified into parameter-driven and observation-driven. Compared to the former one, an observation-driven model often allows for easier and more straightforward estimation of the model parameters. On the other hand, the stability properties of the process, such as the existence and uniqueness of a stationary and ergodic solution that are required for deriving asymptotic theory of the parameter estimates, can be quite complicated to establish, as compared to parameter-driven models. In this thesis, we first propose a broad class of observation-driven models that is based upon a one-parameter exponential family of distributions and incorporates nonlinear dynamics. The establishment of stability properties of these processes, which is at the heart of this thesis, is addressed by employing theory from iterated random functions and coupling techniques. Using this theory, we are also able to obtain the asymptotic behavior of maximum likelihood estimates of the parameters. Extensions of the base model in several directions are considered. Inspired by the idea of a self-excited threshold ARMA process, a threshold Poisson autoregression is proposed. It introduces a two-regime structure in the intensity process and essentially allows for modeling negatively correlated observations. E-chain, a non-standard Markov chain technique and Lyapunov's method are utilized to show the stationarity and a law of large numbers for this process. In addition, the model has been adapted to incorporate covariates, an important problem of practical and primary interest. The base model is also extended to consider the case of multivariate time series of counts. Given a suitable definition of a multivariate Poisson distribution, a multivariate Poisson autoregression process is described and its properties studied. Several simulation studies are presented to illustrate the inference theory. The proposed models are also applied to several real data sets, including the number of transactions of the Ericsson stock, the return times of Goldman Sachs Group stock prices, the number of road crashes in Schiphol, the frequencies of occurrences of gold particles, the incidences of polio in the US and the number of presentations of asthma in an Australian hospital. An array of graphical and quantitative diagnostic tools, which is specifically designed for the evaluation of goodness of fit for time series of counts models, is described and illustrated with these data sets.Statisticshl2494StatisticsDissertationsStatistical inference in two non-standard regression problems
http://academiccommons.columbia.edu/catalog/ac:151460
Seijo, Emilio Franciscohttp://hdl.handle.net/10022/AC:P:14317Wed, 08 Aug 2012 13:43:26 +0000This thesis analyzes two regression models in which their respective least squares estimators have nonstandard asymptotics. It is divided in an introduction and two parts. The introduction motivates the study of nonstandard problems and presents an outline of the contents of the remaining chapters. In part I, the least squares estimator of a multivariate convex regression function is studied in great detail. The main contribution here is a proof of the consistency of the aforementioned estimator in a completely nonparametric setting. Model misspecification, local rates of convergence and multidimensional regression models mixing convexity and componentwise monotonicity constraints will also be considered. Part II deals with change-point regression models and the issues that might arise when applying the bootstrap to these problems. The classical bootstrap is shown to be inconsistent on a simple change-point regression model, and an alternative (smoothed) bootstrap procedure is proposed and proved to be consistent. The superiority of the alternative method is also illustrated through a simulation study. In addition, a version of the continuous mapping theorem specially suited for change-point estimators is proved and used to derive the results concerning the bootstrap.Statistics, Applied mathematics, Mathematicsefs2113StatisticsDissertationsOpen Challenges to Open Science
http://academiccommons.columbia.edu/catalog/ac:147784
Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:13497Mon, 11 Jun 2012 12:04:48 +0000Discusses solutions to the credibility crisis in computational science.Technical communication, Intellectual propertyvcs2115StatisticsPresentationsTransparency in Scientific Discovery: Innovation and Knowledge Dissemination
http://academiccommons.columbia.edu/catalog/ac:147781
Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:13496Mon, 11 Jun 2012 12:00:56 +0000Discusses solutions to the credibility crisis in computational science.Technical communication, Intellectual propertyvcs2115StatisticsPresentationsFraming Science Policy: Reproducible Research, Not Open Data
http://academiccommons.columbia.edu/catalog/ac:147778
Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:13495Mon, 11 Jun 2012 11:56:52 +0000Discusses open data and open code as solutions to the credibility crisis in computational science.Technical communication, Intellectual propertyvcs2115StatisticsPresentationsReproducible Research in Computational Science: Strategies for Innovation
http://academiccommons.columbia.edu/catalog/ac:147775
Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:13494Mon, 11 Jun 2012 11:53:41 +0000Discusses solutions to the credibility crisis in computational science.Technical communication, Intellectual propertyvcs2115StatisticsPresentationsComments on "Measuring Racial Profiling"
http://academiccommons.columbia.edu/catalog/ac:147771
Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:13493Mon, 11 Jun 2012 11:39:12 +0000Discussion of a quantitative analysis of race and criminal justice.Criminologyvcs2115StatisticsPresentationsThe Credibility Crisis in Computational Science: A Call to Action
http://academiccommons.columbia.edu/catalog/ac:147766
Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:13492Mon, 11 Jun 2012 11:36:49 +0000Discusses solutions to the credibility crisis in computational science.Technical communication, Intellectual propertyvcs2115StatisticsPresentationsReproducible Research: A Digital Curation Agenda
http://academiccommons.columbia.edu/catalog/ac:147763
Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:13491Mon, 11 Jun 2012 11:28:14 +0000Discusses the necessity of open data and open code as a solution to the credibility crisis in computational science.Information science, Intellectual propertyvcs2115StatisticsPresentationsReproducibility in Computational Science
http://academiccommons.columbia.edu/catalog/ac:147760
Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:13490Mon, 11 Jun 2012 11:12:58 +0000Discusses solutions to the credibility crisis in computational science.Technical communication, Intellectual propertyvcs2115StatisticsPresentationsThe Credibility Crisis in Computational Science: An Information Issue
http://academiccommons.columbia.edu/catalog/ac:147757
Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:13489Mon, 11 Jun 2012 11:08:37 +0000Discusses solutions to the credibility crisis in computational science.Technical communication, Intellectual propertyvcs2115StatisticsPresentationsThe Reproducible Computational Science Movement: Tools, Policy, and Results
http://academiccommons.columbia.edu/catalog/ac:147754
Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:13488Mon, 11 Jun 2012 11:02:50 +0000Discusses solutions to the credibility crisis in computational science.Technical communication, Intellectual propertyvcs2115StatisticsPresentationsBuilding the Reproducible Computational Science Movement: Catalysing Action through Policy, Software Tools, and Ideas
http://academiccommons.columbia.edu/catalog/ac:147751
Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:13487Mon, 11 Jun 2012 10:58:44 +0000Discusses solutions to the credibility crisis in computational science.Technical communication, Intellectual propertyvcs2115StatisticsPresentationsIntellectual Property and Innovation in Computational Science: Dissemination of Ideas and Methodology
http://academiccommons.columbia.edu/catalog/ac:147748
Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:13486Mon, 11 Jun 2012 10:38:52 +0000Discusses solutions to the credibility crisis in computational science.Technical communication, Intellectual propertyvcs2115StatisticsPresentationsCopyright and MetaData in the World Heritage Digital Mathematical Library
http://academiccommons.columbia.edu/catalog/ac:147745
Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:13485Mon, 11 Jun 2012 10:16:28 +0000Intellectual property, Library sciencevcs2115StatisticsPresentationsScientists, Share Secrets or Lose Funding
http://academiccommons.columbia.edu/catalog/ac:147742
Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:13484Mon, 11 Jun 2012 09:40:44 +0000More and more published scientific studies are difficult or impossible to repeat. Federal agencies that fund scientific research are in a position to help fix this problem. They should require that all scientists whose studies they finance share the files that generated their published findings, the raw data and the computer instructions that carried out their analysis.Technical communication, Intellectual propertyvcs2115StatisticsArticlesThe Central Role of Geophysics in the Reproducible Research Movement
http://academiccommons.columbia.edu/catalog/ac:147729
Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:13456Fri, 08 Jun 2012 09:54:48 +0000Discusses solutions to the credibility crisis in computational science.Technical communication, Intellectual propertyvcs2115StatisticsPresentationsGenus Distributions of Graphs Constructed Through Amalgamations
http://academiccommons.columbia.edu/catalog/ac:146091
Poshni, Mehvish Irfanhttp://hdl.handle.net/10022/AC:P:12989Thu, 12 Apr 2012 12:46:03 +0000Graphs are commonly represented as points in space connected by lines. The points in space are the vertices of the graph, and the lines joining them are the edges of the graph. A general definition of a graph is considered here, where multiple edges are allowed between two vertices and an edge is permitted to connect a vertex to itself. It is assumed that graphs are connected, i.e., any vertex in the graph is reachable from another distinct vertex either directly through an edge connecting them or by a path consisting of intermediate vertices and connecting edges. Under this visual representation, graphs can be drawn on various surfaces. The focus of my research is restricted to a class of surfaces that are characterized as compact connected orientable 2-manifolds. The drawings of graphs on surfaces that are of primary interest follow certain prescribed rules. These are called 2-cellular graph embeddings, or simply embeddings. A well-known closed formula makes it easy to enumerate the total number of 2-cellular embeddings for a given graph over all surfaces. A much harder task is to give a surface-wise breakdown of this number as a sequence of numbers that count the number of 2-cellular embeddings of a graph for each orientable surface. This sequence of numbers for a graph is known as the genus distribution of a graph. Prior research on genus distributions of graphs has primarily focused on making calculations of genus distributions for specific families of graphs. These families of graphs have often been contrived, and the methods used for finding their genus distributions have not been general enough to extend to other graph families. The research I have undertaken aims at developing and using a general method that frames the problem of calculating genus distributions of large graphs in terms of a partitioning of the genus distributions of smaller graphs. To this end, I use various operations such as edge-amalgamation, self-edge-amalgamation, and vertex-amalgamation to construct large graphs out of smaller graphs, by coupling their vertices and edges together in certain consistent ways. This method assumes that the partitioned genus distribution of the smaller graphs is known or is easily calculable by computer, for instance, by using the famous Heffter-Edmonds algorithm. As an outcome of the techniques used, I obtain general recurrences and closed-formulas that give genus distributions for infinitely many recursively specifiable graph families. I also give an easily understood method for finding non-trivial examples of distinct graphs having the same genus distribution. In addition to this, I describe an algorithm that computes the genus distributions for a family of graphs known as the 4-regular outerplanar graphs.Computer sciencemp2452Computer Science, StatisticsDissertationsData Management and Federal Funding: What Researchers Need to Know
http://academiccommons.columbia.edu/catalog/ac:142524
Choudhury, Sayeed; Stodden, Victoria C.; Lehnert, Kerstin A.; Schlosser, Peterhttp://hdl.handle.net/10022/AC:P:11997Wed, 14 Dec 2011 11:22:22 +0000New requirements from the National Science Foundation and other federal agencies have brought data management and sharing into the spotlight. This trend will continue as more research sponsors, and the general public, demand increased access to federally-funded research data. This event examines the goals of these requirements and explore the technical, scientific, and professional challenges resulting from efforts to preserve and share data.Information sciencevcs2115, kal50, ps10Statistics, Lamont-Doherty Earth Observatory, Earth and Environmental Engineering, Earth Institute, Libraries and Information Services, Center for Digital Research and Scholarship, Scholarly Communication ProgramInterviews and roundtablesFinding a Maximum-Genius Graph Impeding
http://academiccommons.columbia.edu/catalog/ac:142035
Furst, Merrick L.; Gross, Jonathan L.; McGeoch, Lyle A.http://hdl.handle.net/10022/AC:P:11837Mon, 28 Nov 2011 12:37:02 +0000The computational complexity of constructing the imbeddings of a given graph into surfaces of different genus is not well-understood. In this paper, topological methods and a reduction to linear matroid parity are used to develop a polynomial-time algorithm to find a maximum-genus cellular imbedding. This seems to be the first imbedding algorithm for which the running time is not exponential in the genus of the imbedding surface.Computer sciencejlg2Computer Science, StatisticsTechnical reportsAn Information-Theoretic Scale for Cultural Rule Systems
http://academiccommons.columbia.edu/catalog/ac:140503
Gross, Jonathan L.http://hdl.handle.net/10022/AC:P:11478Tue, 18 Oct 2011 11:57:04 +0000Important cultural messages are expressed in nonverbal media such as food, clothing, or the allocation of space or time. For instance, how and what a group of persons eats on a particular occasion may convey public information about that occasion and about the group of persons eating together. Whereas attention seems to be most commonly directed toward the individual character of the information, the present concern is the quantity of public information, as observed in the pattern of nonverbal cultural signs. To measure this quantity, it is proposed that the pattern of cultural signs be encoded as a sequence of abstract symbols (e.g. letters of the alphabet) and its complexity appraised by a suitably adapted form of the measure of Kolmogorov and Chaitin. That is, an algorithmic language is constructed and the mathematical information quantity is reckoned as the length of the shortest program that yields the sequence. In this cultural context, the measure is called "intricacy". By focusing on syntactic structure and pattern variation rather than on background levels, intricacy resists some influences of material wealth that tend to distort comparisons of individuals and groups. A compact mathematical overview of the theory is presented and an experiment to test it within the social medium of food sharing is briefly described.Information science, Sociology, Applied mathematicsjlg2Computer Science, StatisticsTechnical reportsMultiscale Representations for Manifold-Valued Data
http://academiccommons.columbia.edu/catalog/ac:140178
Rahman, Inam Ur; Drori, Iddo; Stodden, Victoria C.; Donoho, David L.; Schroeder, Peterhttp://hdl.handle.net/10022/AC:P:11434Tue, 11 Oct 2011 15:45:58 +0000We describe multiscale representations for data observed on equispaced grids and taking values in manifolds such as: the sphere S2, the special orthogonal group SO(3), the positive definite matrices SPD(n), and the Grassmann manifolds G(n, k). The representations are based on the deployment of Deslauriers-Dubuc and Average Interpolating pyramids "in the tangent plane" of such manifolds, using the Exp and Log maps of those manifolds. The representations provide "wavelet coefficients" which can be thresholded, quantized, and scaled much as traditional wavelet coefficients. Tasks such as compression, noise removal, contrast enhancement, and stochastic simulation are facilitated by this representation. The approach applies to general manifolds, but is particularly suited to the manifolds we consider, i.e. Riemanian symmetric spaces, such as Sn−1, SO(n), G(n, k), where the Exp and Log maps are effectively computable. Applications to manifold-valued data sources of a geometric nature (motion, orientation, diffusion) seem particularly immediate. A software toolbox, SymmLab, can reproduce the results discussed in this paper.Statisticsvcs2115StatisticsArticlesWhen Does Non-Negative Matrix Factorization Give a Correct Decomposition into Parts?
http://academiccommons.columbia.edu/catalog/ac:140175
Donoho, David L.; Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:11433Tue, 11 Oct 2011 15:32:23 +0000We interpret non-negative matrix factorization geometrically, as the problem of finding a simplicial cone which contains a cloud of data points and which is contained in the positive orthant. We show that under certain conditions, basically requiring that some of the data are spread across the faces of the positive orthant, there is a unique such simplicial cone. We give examples of synthetic image articulation databases which obey these conditions; these require separated support and factorial sampling. For such databases there is a generative model in terms of "parts" and NMF correctly identifies the "parts". We show that our theoretical results are predictive of the performance of published NMF code, by running the published algorithms on one of our synthetic image articulation databases.Statisticsvcs2115StatisticsArticlesFast l1 Minimization for Genomewide Analysis of mRNA Lengths
http://academiccommons.columbia.edu/catalog/ac:140172
Drori, Iddo; Stodden, Victoria C.; Hurowitz, Evan H.Tue, 11 Oct 2011 15:19:48 +0000Application of the virtual northern method to human mRNA allows us to systematically measure transcript length on a genome-wide scale [1]. Characterization of RNA transcripts by length provides a measurement which complements cDNA sequencing. We have robustly extracted the lengths of the transcripts expressed by each gene for comparison with the Unigene, Refseq, and H-Invitational databases [2, 3]. Obtaining an accurate probability for each peak requires performing multiple bootstrap simulations, each involving a deconvolution operation which is equivalent to finding the sparsest non-negative solution of an underdetermined system of equations. This process is computationally intensive for a large number of simulations and genes. In this contribution we present an efficient approximation method which is faster than general purpose solvers by two orders of magnitude, and in practice reduces our processing time from a week to hours.Genetics, Statisticsvcs2115StatisticsArticlesBreakdown Point of Model Selection When the Number of Variables Exceeds the Number of Observations
http://academiccommons.columbia.edu/catalog/ac:140168
Donoho, David L.; Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:11431Tue, 11 Oct 2011 15:07:17 +0000The classical multivariate linear regression problem assumes p variables X1, X2, ... , Xp and a response vector y, each with n observations, and a linear relationship between the two: y = X beta + z, where z ~ N(0, sigma2). We point out that when p > n, there is a breakdown point for standard model selection schemes, such that model selection only works well below a certain critical complexity level depending on n/p. We apply this notion to some standard model selection algorithms (Forward Stepwise, LASSO, LARS) in the case where pGtn. We find that 1) the breakdown point is well-de ned for random X-models and low noise, 2) increasing noise shifts the breakdown point to lower levels of sparsity, and reduces the model recovery ability of the algorithm in a systematic way, and 3) below breakdown, the size of coefficient errors follows the theoretical error distribution for the classical linear model.Statisticsvcs2115StatisticsArticlesSparseLab Architecture
http://academiccommons.columbia.edu/catalog/ac:140164
Donoho, David L.; Stodden, Victoria C.; Tsaig, Yaakovhttp://hdl.handle.net/10022/AC:P:11430Tue, 11 Oct 2011 14:54:27 +0000Changes and Enhancements for Release 2.0: 4 papers have been added to SparseLab 2.0: "Fast Solution of l1-norm Minimization Problems When the Solutions May be Sparse"; "Why Simple Shrinkage is Still Relevant For Redundant Representations"; "Stable Recovery of Sparse Overcomplete Representations in the Presence of Noise"; "On the Stability of Basis Pursuit in the Presence of Noise." This document describes the architecture of SparseLab version 2.0. It is designed for users who already have had day-to-day interaction with the package and now need specific details about the architecture of the package, for example to modify components for their own research.Technical communication, Computer sciencevcs2115StatisticsTechnical reportsAbout SparseLab
http://academiccommons.columbia.edu/catalog/ac:140160
Donoho, David L.; Stodden, Victoria C.; Tsaig, Yaakovhttp://hdl.handle.net/10022/AC:P:11429Tue, 11 Oct 2011 14:42:12 +0000Changes and Enhancements for Release 2.0: 4 papers have been added to SparseLab 200: "Fast Solution of l1-norm Minimization Problems When the Solutions May be Sparse"; "Why Simple Shrinkage is Still Relevant For Redundant Representations"; "Stable Recovery of Sparse Overcomplete Representations in the Presence of Noise"; "On the Stability of Basis Pursuit in the Presence of Noise." SparseLab is a library of Matlab routines for finding sparse solutions to underdetermined systems. The library is available free of charge over the Internet. Versions are provided for Macintosh, UNIX and Windows machines. Downloading and installation instructions are given here. SparseLab has over 400 .m files which are documented, indexed and cross-referenced in various ways. In this document we suggest several ways to get started using SparseLab: (a) trying out the pedagogical examples, (b) running the demonstrations, which illustrate the use of SparseLab in published papers, and (c) browsing the extensive collection of source files, which are self-documenting. SparseLab makes available, in one package, all the code to reproduce all the figures in the included published articles. The interested reader can inspect the source code to see exactly what algorithms were used, and how parameters were set in producing our figures, and can then modify the source to produce variations on our results. SparseLab has been developed, in part, because of exhortations by Jon Claerbout of Stanford that computational scientists should engage in "really reproducible" research. This document helps with installation and getting started, as well as describing the philosophy, limitations and rules of the road for this software.Technical communication, Computer sciencevcs2115StatisticsTechnical reportsVirtual Northern Analysis of the Human Genome
http://academiccommons.columbia.edu/catalog/ac:140156
Hurowitz, Evan H.; Drori, Iddo; Stodden, Victoria C.; Brown, Patrick O.; Donoho, David L.http://hdl.handle.net/10022/AC:P:11428Tue, 11 Oct 2011 14:27:15 +0000We applied the Virtual Northern technique to human brain mRNA to systematically measure human mRNA transcript lengths on a genome-wide scale. We used separation by gel electrophoresis followed by hybridization to cDNA microarrays to measure 8,774 mRNA transcript lengths representing at least 6,238 genes at high (>90%) confidence. By comparing these transcript lengths to the Refseq and H-Invitational full-length cDNA databases, we found that nearly half of our measurements appeared to represent novel transcript variants. Comparison of length measurements determined by hybridization to different cDNAs derived from the same gene identified clones that potentially correspond to alternative transcript variants. We observed a close linear relationship between ORF and mRNA lengths in human mRNAs, identical in form to the relationship we had previously identified in yeast. Some functional classes of protein are encoded by mRNAs whose untranslated regions (UTRs) tend to be longer or shorter than average; these functional classes were similar in both human and yeast. Human transcript diversity is extensive and largely unannotated. Our length dataset can be used as a new criterion for judging the completeness of cDNAs and annotating mRNA sequences. Similar relationships between the lengths of the UTRs in human and yeast mRNAs and the functions of the proteins they encode suggest that UTR sequences serve an important regulatory role among eukaryotes.Genetics, Molecular biologyvcs2115StatisticsArticlesThe Legal Framework for Reproducible Scientific Research: Licensing and Copyright
http://academiccommons.columbia.edu/catalog/ac:140153
Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:11427Tue, 11 Oct 2011 13:53:13 +0000As computational researchers increasingly make their results available in a reproducible way, and often outside the traditional journal publishing mechanism, questions naturally arise with regard to copyright, subsequent use and citation, and ownership rights in general. The growing number of scientists who release their research publicly face a gap in the current licensing and copyright structure, particularly on the Internet. Scientific research produces more than the final paper: The code, data structures, experimental design and parameters, documentation, and figures are all important for scholarship communication and result replication. The author proposes the reproducible research standard for scientific researchers to use for all components of their scholarship that should encourage reproducible scientific investigation through attribution, facilitate greater collaboration, and promote engagement of the larger community in scientific learning and discovery.Technical communication, Intellectual propertyvcs2115StatisticsArticlesReproducible Research in Computational Harmonic Analysis
http://academiccommons.columbia.edu/catalog/ac:140150
Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:11426Tue, 11 Oct 2011 13:47:29 +0000Scientific computation is emerging as absolutely central to the scientific method. Unfortunately, it's error-prone and currently immature—traditional scientific publication is incapable of finding and rooting out errors in scientific computation—which must be recognized as a crisis. An important recent development and a necessary response to the crisis is reproducible computational research in which researchers publish the article along with the full computational environment that produces the results. The authors have practiced reproducible computational research for 15 years and have integrated it with their scientific research and with doctoral and postdoctoral education. In this article, they review their approach and how it has evolved over time, discussing the arguments for and against working reproducibly.Technical communication, Information sciencevcs2115StatisticsArticlesEnabling Reproducible Research: Open Licensing for Scientific Innovation
http://academiccommons.columbia.edu/catalog/ac:140147
Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:11425Tue, 11 Oct 2011 13:17:47 +0000There is a gap in the current licensing and copyright structure for the growing number of scientists releasing their research publicly, particularly on the Internet. Scientific research produces more scholarship than the final paper: for example, the code, data structures, experimental design and parameters, documentation, and figures, are all important both for communication of the scholarship and replication of the results. US copyright law is a barrier to the sharing of scientific scholarship since it establishes exclusive rights for creators over their work, thereby limiting the ability of others to copy, use, build upon, or alter the research. This is precisely opposite to prevailing scientific norms, which provide both that results be replicated before accepted as knowledge, and that scientific understanding be built upon previous discoveries for which authorship recognition is given. In accordance with these norms and to encourage the release of all scientific scholarship, I propose the Reproducible Research Standard (RRS) both to ensure attribution and facilitate the sharing of scientific works. Using the RRS on all components of scientific scholarship will encourage reproducible scientific investigation, facilitate greater collaboration, and promote engagement of the larger community in scientific learning and discovery.Technical communication, Intellectual propertyvcs2115StatisticsArticlesA Global Empirical Evaluation of New Communication Technology Use and Democratic Tendency
http://academiccommons.columbia.edu/catalog/ac:140144
Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:11424Tue, 11 Oct 2011 12:25:25 +0000Is the dramatic increase in Internet use associated with a commensurate rise in democracy? Few previous studies have drawn on multiple perception-based measures of governance to assess the Internets effects on the process of democratization. This paper uses perception-based time series data on "Voice & Accountability," "Political Stability," and "Rule of Law" to provide insights into democratic tendency. The results of regression analysis suggest that the level of "Voice & Accountability" in a country increases with Internet use, while the level of "Political Stability" decreases with increasing Internet use. Additionally, Internet use was found to increase significantly for countries with increasing levels of "Voice & Accountability" In contrast, "Rule of Law" was not significantly affected by a country's level of Internet use. Increasing cell phone use did not seem to affect either "Voice & Accountability", "Political Stability" or "Rule of Law." In turn, cell phone use was not affected by any of these three measures of democratic tendency. When limiting our analysis to autocratic regimes, we noted a significant negative effect of Internet and cell phone use on "Political Stability" and found that the "Rule of Law" and "Political Stability" metrics drove ICT adoption.Web studies, Political sciencevcs2115StatisticsArticlesOpen science: policy implications for the evolving phenomenon of user-led scientific innovation
http://academiccommons.columbia.edu/catalog/ac:140127
Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:11419Mon, 10 Oct 2011 16:21:09 +0000From contributions of astronomy data and DNA sequences to disease treatment research, scientific activity by non-scientists is a real and emergent phenomenon, and raising policy questions. This involvement in science can be understood as an issue of access to publications, code, and data that facilitates public engagement in the research process, thus appropriate policy to support the associated welfare enhancing benefits is essential. Current legal barriers to citizen participation can be alleviated by scientists' use of the "Reproducible Research Standard," thus making the literature, data, and code associated with scientific results accessible. The enterprise of science is undergoing deep and fundamental changes, particularly in how scientists obtain results and share their work: the promise of open research dissemination held by the Internet is gradually being fulfilled by scientists. Contributions to science from beyond the ivory tower are forcing a rethinking of traditional models of knowledge generation, evaluation, and communication. The notion of a scientific "peer" is blurred with the advent of lay contributions to science raising questions regarding the concepts of peer-review and recognition. New collaborative models are emerging around both open scientific software and the generation of scientific discoveries that bear a similarity to open innovation models in other settings. Public engagement in science can be understood as an issue of access to knowledge for public involvement in the research process, facilitated by appropriate policy to support the welfare enhancing benefits deriving from citizen-science.Technical communication, Information sciencevcs2115StatisticsArticlesReproducible Research: Addressing the Need for Data and Code Sharing in Computational Science
http://academiccommons.columbia.edu/catalog/ac:140124
Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:11418Mon, 10 Oct 2011 16:05:57 +0000Roundtable participants identified ways of making computational research details readily available, which is a crucial step in addressing the current credibility crisis.Technical communication, Information sciencevcs2115StatisticsArticlesThe Scientific Method in Practice: Reproducibility in the Computational Sciences
http://academiccommons.columbia.edu/catalog/ac:140117
Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:11417Mon, 10 Oct 2011 15:53:35 +0000Since the 1660's the scientific method has included reproducibility as a mainstay in its effort to root error from scientific discovery. With the explosive growth of digitization in scientific research and communication, it is easier than ever to satisfy this requirement. In computational research experimental details and methods can be recorded in code and scripts, data is digital, papers are frequently online, and the result is the potential for "really reproducible research." Imagine the ability to routinely inspect code and data and recreate others' results: Every step taken to achieve the findings can potentially be transparent. Now imagine anyone with an Internet connection and the capability of running the code being able to do this. This paper investigates the obstacles blocking the sharing of code and data to understand conditions under which computational scientists reveal their full research compendium. A survey of registrants at a top machine learning conference (NIPS) was used to discover the strength of underlying factors that affect the decision to reveal code, data, and ideas. Sharing of code and data is becoming more common as about a third of respondents post some on their websites, and about 85% self report to have some code or data publicly available on the web. Contrary to theoretical expectations, the decision to share work is grounded in communitarian norms, although when work remains hidden private incentives dominate the decision. We find that code, data, and ideas are each regarded differently in terms of how they are revealed and that guidance from scientific norms varies with pervasiveness of computation in the field. The largest barriers to sharing are time involved in preparation of work and the legal Intellectual Property framework scientists face. This paper does two things. It provides evidence in the debate about whether scientists' research revealing behavior is wholly governed by considerations of personal impact or whether the reasoning behind the revealing decision involves larger scientific ideals, and secondly, this research describes the actual sharing behavior in the Machine Learning community.Technical communication, Computer sciencevcs2115StatisticsWorking papersRemarks presented before the National Academies Committee on the Impact of Copyright Policy on Innovation in the Digital Era
http://academiccommons.columbia.edu/catalog/ac:140113
Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:11416Mon, 10 Oct 2011 15:42:23 +0000Thank you for the opportunity to address this committee at the National Academy of Science. You are uniquely positioned to contend with the barriers to innovation that arise through the impact of copyright law on scientific integrity. In my remarks I hope to convince you of the urgent need for the Committee to redress these barriers directly by recommending open licensing for scientific works, in particular code and data. Copyright law works counter to scientific progress, with enormous impact on innovation both inside and outside the scientific enterprise.Technical communication, Intellectual propertyvcs2115StatisticsPresentationsCyber Science and Engineering: A Report of the National Science Foundation Advisory Committee for Cyberinfrastructure Task Force on Grand Challenges
http://academiccommons.columbia.edu/catalog/ac:140109
Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:11415Mon, 10 Oct 2011 15:08:10 +0000This document contains the findings and recommendations of the NSF – Advisory Committee for Cyberinfrastructure Task Force on Grand Challenges addressed by advances in Cyber Science and Engineering. The term Cyber Science and Engineering (CS&E) is introduced to describe the intellectual discipline that brings together core areas of science and engineering, computer science, and computational and applied mathematics in a concerted effort to use the cyberinfrastructure (CI) for scientific discovery and engineering innovations; CS&E is computational and data-based science and engineering enabled by CI. The report examines a host of broad issues faced in addressing the Grand Challenges of science and technology and explores how those can be met by advances in CI. Included in the report are recommendations for new programs and initiatives that will expand the portfolio of the Office of Cyberinfrastructure and that will be critical to advances in all areas of science and engineering that rely on the CI.Technical communication, Information sciencevcs2115StatisticsReportsIntellectual Contributions to Digitized Science: Implementing the Scientific Method
http://academiccommons.columbia.edu/catalog/ac:139738
Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:11401Thu, 06 Oct 2011 12:08:53 +0000Technical communication, Information sciencevcs2115StatisticsPresentationsBasics of Intellectual Property for Computational Scientists
http://academiccommons.columbia.edu/catalog/ac:139735
Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:11400Thu, 06 Oct 2011 12:04:21 +0000Presented at the Applied Mathematics Perspectives workshop, "Reproducible Research: Tools and Strategies for Scientific Computing," Vancouver, B.C., July 13-16, 2011.Technical communication, Information sciencevcs2115StatisticsPresentationsFunding Agency Policy and the Credibility Crisis in Computational Science
http://academiccommons.columbia.edu/catalog/ac:139732
Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:11399Thu, 06 Oct 2011 11:56:41 +0000Technical communication, Information sciencevcs2115StatisticsPresentationsWhat is Reproducible Research? The Practice of Science Today and the Scientific Method
http://academiccommons.columbia.edu/catalog/ac:139728
Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:11398Thu, 06 Oct 2011 11:48:22 +0000Technical communication, Information sciencevcs2115StatisticsPresentationsReproducibility in Computational Science: Framing the Concept
http://academiccommons.columbia.edu/catalog/ac:139725
Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:11397Thu, 06 Oct 2011 11:43:02 +0000Technical communication, Information sciencevcs2115StatisticsPresentationsIntellectual Property and Computational Science
http://academiccommons.columbia.edu/catalog/ac:139722
Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:11396Thu, 06 Oct 2011 11:38:10 +0000Technical communication, Information science, Intellectual propertyvcs2115StatisticsPresentationsPolicies for Scientific Integrity and Reproducibility: Data and Code Sharing
http://academiccommons.columbia.edu/catalog/ac:139719
Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:11395Thu, 06 Oct 2011 11:34:29 +0000Technical communication, Information sciencevcs2115StatisticsPresentationsFacilitating Scientific Discovery in the Digital Age
http://academiccommons.columbia.edu/catalog/ac:139716
Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:11394Thu, 06 Oct 2011 11:28:27 +0000Technical communication, Information sciencevcs2115StatisticsPresentationsScientific Reproducibility: First Steps and Guiding Questions
http://academiccommons.columbia.edu/catalog/ac:139713
Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:11393Thu, 06 Oct 2011 11:20:43 +0000Technical communication, Information science, Intellectual propertyvcs2115StatisticsPresentationsTechnology and the Scientific Method: Tools and Policies for Addressing the Credibility Crisis in Computational Science
http://academiccommons.columbia.edu/catalog/ac:139710
Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:11392Thu, 06 Oct 2011 11:17:45 +0000Technical communication, Information science, Intellectual propertyvcs2115StatisticsPresentationsTools for Academic Research: Resolving the Credibility Crisis in Computational Science
http://academiccommons.columbia.edu/catalog/ac:139707
Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:11391Thu, 06 Oct 2011 11:12:12 +0000Technical communication, Information science, Intellectual propertyvcs2115StatisticsPresentationsHow Technology Is (Rapidly) Expanding the Scope of the Law in Statistics
http://academiccommons.columbia.edu/catalog/ac:139704
Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:11390Thu, 06 Oct 2011 11:02:07 +0000Technical communication, Information science, Intellectual propertyvcs2115StatisticsPresentationsScientific Practice Today and the Scientific Method: Responding to the Credibility Crisis
http://academiccommons.columbia.edu/catalog/ac:139701
Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:11389Thu, 06 Oct 2011 10:38:16 +0000Technical communication, Information sciencevcs2115StatisticsPresentationsEstablishing Scientific Facts
http://academiccommons.columbia.edu/catalog/ac:141660
Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:11388Thu, 06 Oct 2011 10:30:56 +0000Presented at the FQXi conference, "Setting Time Aright," Copenhagen, August 27-September 1, 2011.Technical communication, Information sciencevcs2115StatisticsPresentationsThe Credibility Crisis and Computational Science: Accountability and Public Health
http://academiccommons.columbia.edu/catalog/ac:139692
Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:11387Thu, 06 Oct 2011 10:20:17 +0000Technical communication, Information sciencevcs2115StatisticsPresentationsData Management and Sharing Policies in the NSF and the NIH
http://academiccommons.columbia.edu/catalog/ac:139689
Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:11386Thu, 06 Oct 2011 10:06:36 +0000Technical communication, Information sciencevcs2115StatisticsPresentationsData Sharing in Social Science Repositories: Facilitating Reproducible Computational Research
http://academiccommons.columbia.edu/catalog/ac:139591
Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:11346Mon, 03 Oct 2011 16:53:23 +0000From new types of data to new computational methodologies, computation is engendering a revolution in social science research and with this comes the issue of facilitating data and code sharing to encourage collaboration and reproducibility in scientific publishing. A repository designed for this purpose at Harvard University, the Dataverse Network, permits authors to upload data and code with their own terms of use. This paper examines these terms of use for 30,090 uploads to discover barrier issues to sharing in the social sciences and compares them to those found in a survey of NIPS registrants. We find that the additionally specified terms of use in The Dataverse Network primarily address issues of maintaining subject confidentiality, preventing further sharing, making specific citation a condition of use, restricting access by commercial or profit-making entities, and time embargoes, which differs to those elucidated among NIPS participants. Using these findings we suggest a sharing framework for social science data to expand engagement of the larger social science community and encourage verification of research findings.Technical communication, Information sciencevcs2115StatisticsArticlesWhite Paper for Expert Panel Discussion on Data Policies: A Workshop of the National Science Board, March 27-29, 2011
http://academiccommons.columbia.edu/catalog/ac:139588
Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:11345Mon, 03 Oct 2011 16:26:37 +0000In our workshop charge we were invited to read three reports that formed the basis for the NSB‐approved Data Policies Task Force's "Statement of Principles," providing the starting point for this workshop. I take a contrarian perspective and challenge the assumption in all these documents that open data is a foundational component of the scientific endeavor. Instead, I argue that the framing principle should be the reproducibility of computational results, from which open data (along with open code) falls as a natural corollary. In this note I highlight six implications of the framing of reproducible research as a guiding principle for science policy in the digital age.Technical communication, Information sciencevcs2115StatisticsArticlesInnovation and Growth through Open Access to Scientific Research: Three Ideas for High-Impact Rule Changes
http://academiccommons.columbia.edu/catalog/ac:139585
Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:11344Mon, 03 Oct 2011 16:09:15 +0000Technical communication, Intellectual propertyvcs2115StatisticsArticlesTrust Your Science? Open Your Data and Code
http://academiccommons.columbia.edu/catalog/ac:139369
Stodden, Victoria C.http://hdl.handle.net/10022/AC:P:11312Thu, 29 Sep 2011 14:25:26 +0000Information sciencevcs2115StatisticsArticlesHigh dimensional information processing
http://academiccommons.columbia.edu/catalog/ac:139263
Rahnama Rad, Kamiarhttp://hdl.handle.net/10022/AC:P:11287Wed, 28 Sep 2011 12:49:36 +0000Part I: Consider the n-dimensional vector y = Xβ + ǫ where β ∈ Rp has only k nonzero entries and ǫ ∈ Rn is a Gaussian noise. This can be viewed as a linear system with sparsity constraints corrupted by noise, where the objective is to estimate the sparsity pattern of β given the observation vector y and the measurement matrix X. First, we derive a non-asymptotic upper bound on the probability that a specific wrong sparsity pattern is identified by the maximum-likelihood estimator. We find that this probability depends (inversely) exponentially on the difference of kXβk2 and the ℓ2-norm of Xβ projected onto the range of columns of X indexed by the wrong sparsity pattern. Second, when X is randomly drawn from a Gaussian ensemble, we calculate a non-asymptotic upper bound on the probability of the maximum-likelihood decoder not declaring (partially) the true sparsity pattern. Consequently, we obtain sufficient conditions on the sample size n that guarantee almost surely the recovery of the true sparsity pattern. We find that the required growth rate of sample size n matches the growth rate of previously established necessary conditions. Part II: Estimating two-dimensional firing rate maps is a common problem, arising in a number of contexts: the estimation of place fields in hippocampus, the analysis of temporally nonstationary tuning curves in sensory and motor areas, the estimation of firing rates following spike-triggered covariance analyses, etc. Here we introduce methods based on Gaussian process nonparametric Bayesian techniques for estimating these two-dimensional rate maps. These techniques offer a number of advantages: the estimates may be computed efficiently, come equipped with natural errorbars, adapt their smoothness automatically to the local density and informativeness of the observed data, and permit direct fitting of the model hyperparameters (e.g., the prior smoothness of the rate map) via maximum marginal likelihood. We illustrate the flexibility and performance of the new techniques on a variety of simulated and real data. Part III: Many fundamental questions in theoretical neuroscience involve optimal decoding and the computation of Shannon information rates in populations of spiking neurons. In this paper, we apply methods from the asymptotic theory of statistical inference to obtain a clearer analytical understanding of these quantities. We find that for large neural populations carrying a finite total amount of information, the full spiking population response is asymptotically as informative as a single observation from a Gaussian process whose mean and covariance can be characterized explicitly in terms of network and single neuron properties. The Gaussian form of this asymptotic sufficient statistic allows us in certain cases to perform optimal Bayesian decoding by simple linear transformations, and to obtain closed-form expressions of the Shannon information carried by the network. One technical advantage of the theory is that it may be applied easily even to non-Poisson point process network models; for example, we find that under some conditions, neural populations with strong history-dependent (non-Poisson) effects carry exactly the same information as do simpler equivalent populations of non-interacting Poisson neurons with matched firing rates. We argue that our findings help to clarify some results from the recent literature on neural decoding and neuroprosthetic design. Part IV: A model of distributed parameter estimation in networks is introduced, where agents have access to partially informative measurements over time. Each agent faces a local identification problem, in the sense that it cannot consistently estimate the parameter in isolation. We prove that, despite local identification problems, if agents update their estimates recursively as a function of their neighbors’ beliefs, they can consistently estimate the true parameter provided that the communication network is strongly connected; that is, there exists an information path between any two agents in the network. We also show that the estimates of all agents are asymptotically normally distributed. Finally, we compute the asymptotic variance of the agents’ estimates in terms of their observation models and the network topology, and provide conditions under which the distributed estimators are as efficient as any centralized estimator.Neurosciences, Applied mathematicskr2248StatisticsDissertationsSome Problems in Topographical Graph Theory
http://academiccommons.columbia.edu/catalog/ac:138034
Gross, Jonathan L.; Harary, Frankhttp://hdl.handle.net/10022/AC:P:11055Wed, 31 Aug 2011 14:20:08 +0000Computer sciencejlg2Computer Science, StatisticsTechnical reportsSelf-controlled methods for postmarketing drug safety surveillance in large-scale longitudinal data
http://academiccommons.columbia.edu/catalog/ac:137551
Simpson, Shawn E.http://hdl.handle.net/10022/AC:P:10963Mon, 22 Aug 2011 13:01:55 +0000A primary objective in postmarketing drug safety surveillance is to ascertain the relationship between time-varying drug exposures and adverse events (AEs) related to health outcomes. Surveillance can be based on longitudinal observational databases (LODs), which contain time-stamped patient-level medical information including periods of drug exposure and dates of diagnoses. Due to its desirable properties, we focus on the self-controlled case series (SCCS) method for analysis in this context. SCCS implicitly controls for fixed multiplicative baseline covariates since each individual acts as their own control. In addition, only exposed cases are required for the analysis, which is computationally advantageous. In the first part of this work we present how the simple SCCS model can be applied to the surveillance problem, and compare the results of simple SCCS to those of existing methods. Many current surveillance methods are based on marginal associations between drug exposures and AEs. Such analyses ignore confounding drugs and interactions and have the potential to give misleading results. In order to avoid these difficulties, it is desirable for an analysis strategy to incorporate large numbers of time-varying potential confounders such as other drugs. In the second part of this work we propose the Bayesian multiple SCCS approach, which deals with high dimensionality and can provide a sparse solution via a Laplacian prior. We present details of the model and optimization procedure, as well as results of empirical investigations. SCCS is based on a conditional Poisson regression model, which assumes that events at different time points are conditionally independent given the covariate process. This requirement is problematic when the occurrence of an event can alter the future event risk. In a clinical setting, for example, patients who have a first myocardial infarction (MI) may be at higher subsequent risk for a second. In the third part of this work we propose the positive dependence self-controlled case series (PD-SCCS) method: a generalization of SCCS that allows the occurrence of an event to increase the future event risk, yet maintains the advantages of the original by controlling for fixed baseline covariates and relying solely on data from cases. We develop the model and compare the results of PD-SCCS and SCCS on example drug-AE pairs.Statisticsses2155StatisticsDissertationsEstimation of System Reliability Using a Semiparametric Model
http://academiccommons.columbia.edu/catalog/ac:135421
Wu, Leon Li; Teravainen, Timothy Kaleva; Kaiser, Gail E.; Anderson, Roger N.; Boulanger, Albert G.; Rudin, Cynthiahttp://hdl.handle.net/10022/AC:P:10670Fri, 08 Jul 2011 12:08:15 +0000An important problem in reliability engineering is to predict the failure rate, that is, the frequency with which an engineered system or component fails. This paper presents a new method of estimating failure rate using a semiparametric model with Gaussian process smoothing. The method is able to provide accurate estimation based on historical data and it does not make strong a priori assumptions of failure rate pattern (e.g., constant or monotonic). Our experiments of applying this method in power system failure data compared with other models show its efficacy and accuracy. This method can be used in estimating reliability for many other systems, such as software systems or components.Computer sciencellw2107, tkt2103, gek1, rna1Computer Science, Center for Computational Learning Systems, StatisticsTechnical reportsComparing Speed of Provider Data Entry: Electronic Versus Paper Methods
http://academiccommons.columbia.edu/catalog/ac:133547
Jackson, Kevin M.; Kaiser, Gail E.; Wong, Lyndon; Rabinowitz, Daniel; Chiang, Michael F.http://hdl.handle.net/10022/AC:P:10508Wed, 08 Jun 2011 14:25:25 +0000Electronic health record (EHR) systems have significant potential advantages over traditional paper-based systems, but they require that providers assume responsibility for data entry. One significant barrier to adoption of EHRs is the perception of slowed data-entry by providers. This study compares the speed of data-entry using computer-based templates vs. paper for a large eye clinic, using 10 subjects and 10 simulated clinical scenarios. Dataentry into the EHR was significantly slower (p<0.01) than traditional paper forms.Computer sciencegek1, dr105Computer Science, Statistics, Ophthalmology, Biomedical InformaticsTechnical reportsOptimal Trading Strategies Under Arbitrage
http://academiccommons.columbia.edu/catalog/ac:131477
Ruf, Johannes Karl Dominikhttp://hdl.handle.net/10022/AC:P:10250Fri, 29 Apr 2011 18:21:03 +0000This thesis analyzes models of financial markets that incorporate the possibility of arbitrage opportunities. The first part demonstrates how explicit formulas for optimal trading strategies in terms of minimal required initial capital can be derived in order to replicate a given terminal wealth in a continuous-time Markovian context. Towards this end, only the existence of a square-integrable market price of risk (rather than the existence of an equivalent local martingale measure) is assumed. A new measure under which the dynamics of the stock price processes simplify is constructed. It is shown that delta hedging does not depend on the "no free lunch with vanishing risk" assumption. However, in the presence of arbitrage opportunities, finding an optimal strategy is directly linked to the non-uniqueness of the partial differential equation corresponding to the Black-Scholes equation. In order to apply these analytic tools, sufficient conditions are derived for the necessary differentiability of expectations indexed over the initial market configuration. The phenomenon of "bubbles," which has been a popular topic in the recent academic literature, appears as a special case of the setting in the first part of this thesis. Several examples at the end of the first part illustrate the techniques contained therein. In the second part, a more general point of view is taken. The stock price processes, which again allow for the possibility of arbitrage, are no longer assumed to be Markovian, but rather only It^o processes. We then prove the Second Fundamental Theorem of Asset Pricing for these markets: A market is complete, meaning that any bounded contingent claim is replicable, if and only if the stochastic discount factor is unique. Conditions under which a contingent claim can be perfectly replicated in an incomplete market are established. Then, precise conditions under which relative arbitrage and strong relative arbitrage with respect to a given trading strategy exist are explicated. In addition, it is shown that if the market is quasi-complete, meaning that any bounded contingent claim measurable with respect to the stock price filtration is replicable, relative arbitrage implies strong relative arbitrage. It is further demonstrated that markets are quasi-complete, subject to the condition that the drift and diffusion coefficients are measurable with respect to the stock price filtration.Mathematics, Financejkr2115Statistics, MathematicsDissertationsContagion and Systemic Risk in Financial Networks
http://academiccommons.columbia.edu/catalog/ac:131474
Moussa, Amalhttp://hdl.handle.net/10022/AC:P:10249Fri, 29 Apr 2011 18:12:27 +0000The 2007-2009 financial crisis has shed light on the importance of contagion and systemic risk, and revealed the lack of adequate indicators for measuring and monitoring them. This dissertation addresses these issues and leads to several recommendations for the design of an improved assessment of systemic importance, improved rating methods for structured finance securities, and their use by investors and risk managers. Using a complete data set of all mutual exposures and capital levels of financial institutions in Brazil in 2007 and 2008, we explore in chapter 2 the structure and dynamics of the Brazilian financial system. We show that the Brazilian financial system exhibits a complex network structure characterized by a strong degree of heterogeneity in connectivity and exposure sizes across institutions, which is qualitatively and quantitatively similar to the statistical features observed in other financial systems. We find that the Brazilian financial network is well represented by a directed scale-free network, rather than a small world network. Based on these observations, we propose a stochastic model for the structure of banking networks, representing them as a directed weighted scale free network with power law distributions for in-degree and out-degree of nodes, Pareto distribution for exposures. This model may then be used for simulation studies of contagion and systemic risk in networks. We propose in chapter 3 a quantitative methodology for assessing contagion and systemic risk in a network of interlinked institutions. We introduce the Contagion Index as a metric of the systemic importance of a single institution or a set of institutions, that combines the effects of both common market shocks to portfolios and contagion through counterparty exposures. Using a directed scale-free graph simulation of the financial system, we study the sensitivity of contagion to a change in aggregate network parameters: connectivity, concentration of exposures, heterogeneity in degree distribution and network size. More concentrated and more heterogeneous networks are found to be more resilient to contagion. The impact of connectivity is more controversial: in well-capitalized networks, increasing connectivity improves the resilience to contagion when the initial level of connectivity is high, but increases contagion when the initial level of connectivity is low. In undercapitalized networks, increasing connectivity tends to increase the severity of contagion. We also study the sensitivity of contagion to local measures of connectivity and concentration across counterparties --the counterparty susceptibility and local network frailty-- that are found to have a monotonically increasing relationship with the systemic risk of an institution. Requiring a minimum (aggregate) capital ratio is shown to reduce the systemic impact of defaults of large institutions; we show that the same effect may be achieved with less capital by imposing such capital requirements only on systemically important institutions and those exposed to them. In chapter 4, we apply this methodology to the study of the Brazilian financial system. Using the Contagion Index, we study the potential for default contagion and systemic risk in the Brazilian system and analyze the contribution of balance sheet size and network structure to systemic risk. Our study reveals that, aside from balance sheet size, the network-based local measures of connectivity and concentration of exposures across counterparties introduced in chapter 3, the counterparty susceptibility and local network frailty, contribute significantly to the systemic importance of an institution in the Brazilian network. Thus, imposing an upper bound on these variables could help reducing contagion. We examine the impact of various capital requirements on the extent of contagion in the Brazilian financial system, and show that targeted capital requirements achieve the same reduction in systemic risk with lower requirements in capital for financial institutions. The methodology we proposed in chapter 3 for estimating contagion and systemic risk requires visibility on the entire network structure. Reconstructing bilateral exposures from balance sheets data is then a question of interest in a financial system where bilateral exposures are not disclosed. We propose in chapter 5 two methods to derive a distribution of bilateral exposures matrices. The first method attempts to recover the balance sheet assets and liabilities "sample by sample". Each sample of the bilateral exposures matrix is solution of a relative entropy minimization problem subject to the balance sheet constraints. However, a solution to this problem does not always exist when dealing with sparse sample matrices. Thus, we propose a second method that attempts to recover the assets and liabilities "in the mean". This approach is the analogue of the Weighted Monte Carlo method introduced by Avellaneda et al. (2001). We first simulate independent samples of the bilateral exposures matrix from a relevant prior distribution on the network structure, then we compute posterior probabilities by maximizing the entropy under the constraints that the balance sheet assets and liabilities are recovered in the mean. We discuss the pros and cons of each approach and explain how it could be used to detect systemically important institutions in the financial system. The recent crisis has also raised many questions regarding the meaning of structured finance credit ratings issued by rating agencies and the methodology behind them. Chapter 6 aims at clarifying some misconceptions related to structured finance ratings and how they are commonly interpreted: we discuss the comparability of structured finance ratings with bond ratings, the interaction between the rating procedure and the tranching procedure and its consequences for the stability of structured finance ratings in time. These insights are illustrated in a factor model by simulating rating transitions for CDO tranches using a nested Monte Carlo method. In particular, we show that the downgrade risk of a CDO tranche can be quite different from a bond with same initial rating. Structured finance ratings follow path-dependent dynamics that cannot be adequately described, as usually done, by a matrix of transition probabilities. Therefore, a simple labeling via default probability or expected loss does not discriminate sufficiently their downgrade risk. We propose to supplement ratings with indicators of downgrade risk. To overcome some of the drawbacks of existing rating methods, we suggest a risk-based rating procedure for structured products. Finally, we formulate a series of recommendations regarding the use of credit ratings for CDOs and other structured credit instruments.Finance, Statisticsam2810Statistics, Industrial Engineering and Operations ResearchDissertations