Academic Commons Search Results
http://academiccommons.columbia.edu/catalog.rss?f%5Bdepartment_facet%5D%5B%5D=Statistics&f%5Bgenre_facet%5D%5B%5D=Dissertations&q=&rows=500&sort=record_creation_date+desc
Academic Commons Search Resultsen-usAn Assortment of Unsupervised and Supervised Applications to Large Data
http://academiccommons.columbia.edu/catalog/ac:189937
Agne, Michael Roberthttp://dx.doi.org/10.7916/D828073NThu, 15 Oct 2015 00:00:00 +0000This dissertation presents several methods that can be applied to large datasets with an enormous number of covariates. It is divided into two parts. In the first part of the dissertation, a novel approach to pinpointing sets of related variables is introduced. In the second part, several new methods and modifications of current methods designed to improve prediction are outlined. These methods can be considered extensions of the very successful I Score suggested by Lo and Zheng in a 2002 paper and refined in many papers since. In Part I, unsupervised data (with no response) is addressed. In chapter 2, the novel unsupervised I score and its associated procedure are introduced and some of its unique theoretical properties are explored. In chapter 3, several simulations consisting of generally hard-to-wrangle scenarios demonstrate promising behavior of the approach. The method is applied to the complex field of market basket analysis, with a specific grocery data set used to show it in action in chapter 4. It is compared it to a natural competition, the A Priori algorithm. The main contribution of this part of the dissertation is the unsupervised I score, but we also suggest several ways to leverage the variable sets the I score locates in order to mine for association rules. In Part II, supervised data is confronted. Though the I Score has been used in reference to these types of data in the past, several interesting ways of leveraging it (and the modules of covariates it identifies) are investigated. Though much of this methodology adopts procedures which are individually well-established in literature, the contribution of this dissertation is organization and implementation of these methods in the context of the I Score. Several module-based regression and voting methods are introduced in chapter 7, including a new LASSO-based method for optimizing voting weights. These methods can be considered intuitive and readily applicable to a huge number of datasets of sometimes colossal size. In particular, in chapter 8, a large dataset on Hepatitis and another on Oral Cancer are analyzed. The results for some of the methods are quite promising and competitive with existing methods, especially with regard to prediction. A flexible and multifaceted procedure is suggested in order to provide a thorough arsenal when dealing with the problem of prediction in these complex data sets. Ultimately, we highlight some benefits and future directions of the method.Statistics, Biostatisticsmra2110StatisticsDissertationsEfficiency in Lung Transplant Allocation Strategies
http://academiccommons.columbia.edu/catalog/ac:187899
Zou, Jingjinghttp://dx.doi.org/10.7916/D8QV3KKZTue, 12 May 2015 18:28:18 +0000Currently in the United States, lungs are allocated to transplant candidates based on the Lung Allocation Score (LAS). The LAS is an empirically derived score aimed at increasing total life span pre- and post-transplantation, for patients on lung transplant waiting lists. The goal here is to develop efficient allocation strategies in the context of lung transplantation.
In this study, patient and organ arrivals to the waiting list are modeled as independent homogeneous Poisson processes. Patients' health status prior to allocations are modeled as evolving according to independent and identically distributed finite-state inhomogeneous Markov processes, in which death is treated as an absorbing state. The expected post-transplantation residual life is modeled as depending on time on the waiting list and on current health status. For allocation strategies satisfying certain minimal fairness requirements, the long-term limit of expected average total life exists, and is used as the standard for comparing allocation strategies.
Via the Hamilton-Jacobi-Bellman equations, upper bounds as a function of the ratio of organ arrival rate to the patient arrival rate for the long-term expected average total life are derived, and corresponding to each upper bound is an allocable set of (state, time) pairs at which patients would be optimally transplanted. As availability of organs increases, the allocable set expands monotonically, and ranking members of the waiting list according to the availability at which they enter the allocable set provides an allocation strategy that leads to long-term expected average total life close to the upper bound.
Simulation studies are conducted with model parameters estimated from national lung transplantation data from United Network for Organ Sharing (UNOS). Results suggest that compared to the LAS, the proposed allocation strategy could provide a 7% increase in average total life.Statisticsjz2335StatisticsDissertationsMathematical Modeling of Insider Trading
http://academiccommons.columbia.edu/catalog/ac:178871
Bilina Falafala, Roselinehttp://dx.doi.org/10.7916/D89W0D33Mon, 13 Oct 2014 00:00:00 +0000In this thesis, we study insider trading and consider a financial market and an enlarged financial market whose sets of information are respectively represented by the filtrations F and G. The filtration G is obtained by initially expanding the filtration F. We also consider that we have a finite trading horizon. First, we show that under certain conditions the enlarged market satisfies no free lunch with vanishing risk (NFLVR) locally and therefore satisfies no arbitrage with respect to admissible simple predictable trading strategies. In addition, we generalize the structure of all the G local martingale deflators and find sufficient conditions under which the enlarged market satisfies NFLVR. We apply our results to some recent examples of insider trading that have appeared in newspapers and by doing so, show the limitations of some previous works that have studied the stability of the NFLVR property under an initial expansion. \newline Second, assuming the enlarged market satisfies NFLVR and markets are incomplete, we define a notion of risk and compare the risk of a market or liquidity trader to the risk of an insider trader. We prove that the risk of an insider is smaller than the risk of a market/liquidity trader under some sufficient conditions that involve their respective trading strategies. We find a relationship between the trading strategies of a market trader and of an insider when the risk neutral measure of the market is used. If an insider trades using the market risk neutral measure and not her own, then her trading strategy should involve not only the stock but also the volatility of the stock. \newline Finally, assuming that the enlarged market satisfies NFLVR locally, we provide a way for an insider to price her financial claims. We also define a new type of process that we call a quasi-local martingale and prove that the stock price process under local NFLVR is one such process.Applied mathematics, FinanceStatisticsDissertationsApplying Large-Scale Data and Modern Statistical Methods to Classical Problems in American Politics
http://academiccommons.columbia.edu/catalog/ac:177212
Ghitza, Yairhttp://dx.doi.org/10.7916/D8ZS2TT3Mon, 08 Sep 2014 00:00:00 +0000Exponential growth in data storage and computing capacity, alongside the development of new statistical methods, have facilitated powerful and flexible new research capabilities across a variety of disciplines. In each of these three essays, I use some new large-scale data source or advanced statistical method to address a well-known problem in the American Political Science literature. In the first essay, I build a generational model of presidential voting, in which long-term partisan presidential voting preferences are formed, in large part, through a weighted "running tally" of retrospective presidential evaluations, where weights are determined by the age in which the evaluation was made. By gathering hundreds of thousands of survey responses in combination with a new Bayesian model, I show that the political events of a voter's teenage and early adult years, centered around the age of 18, are enormously influential, particularly among white voters. In the second and third essays, I leverage a national voter registration database, which contains records for over 190 million registered voters, alongside methods like multilevel regression and poststratification (MRP) and coarsened exact matching (CEM) to address critical issues in public opinion research and in our understanding of the consequences of higher or lower turnout. In the process, I make numerous methodological and substantive contributions, including: building on the capabilities of MRP generally, describing methods for dealing with data of this size in the context of social science research, and characterizing mathematical limits of how turnout can impact election outcomes.Political scienceyg2173Political Science, StatisticsDissertationsLimit Theory for Spatial Processes, Bootstrap Quantile Variance Estimators, and Efficiency Measures for Markov Chain Monte Carlo
http://academiccommons.columbia.edu/catalog/ac:188852
Yang, Xuanhttp://dx.doi.org/10.7916/D84X560ZThu, 07 Aug 2014 00:00:00 +0000This thesis contains three topics: (I) limit theory for spatial processes, (II) asymptotic results on the bootstrap quantile variance estimator for importance sampling, and (III) an efficiency measure of MCMC. (I) First, central limit theorems are obtained for sums of observations from a $\kappa$-weakly dependent random field. In particular, it is considered that the observations are made from a random field at irregularly spaced and possibly random locations. The sums of these samples as well as sums of functions of pairs of the observations are objects of interest; the latter has applications in covariance estimation, composite likelihood estimation, etc. Moreover, examples of $\kappa$-weakly dependent random fields are explored and a method for the evaluation of $\kappa$-coefficients is presented. Next, statistical inference is considered for the stochastic heteroscedastic processes (SHP) which generalize the stochastic volatility time series model to space. A composite likelihood approach is adopted for parameter estimation, where the composite likelihood function is formed by a weighted sum of pairwise log-likelihood functions. In addition, the observations sites are assumed to distributed according to a spatial point process. Sufficient conditions are provided for the maximum composite likelihood estimator to be consistent and asymptotically normal. (II) It is often difficult to provide an accurate estimation for the variance of the weighted sample quantile. Its asymptotic approximation requires the value of the density function which may be hard to evaluate in complex systems. To circumvent this problem, the bootstrap estimator is considered. Theoretical results are established for the exact convergence rate and asymptotic distributions of the bootstrap variance estimators for quantiles of weighted empirical distributions. Under regularity conditions, it is shown that the bootstrap variance estimator is asymptotically normal and has relative standard deviation of order O(n^-1/4) (III) A new performance measure is proposed to evaluate the efficiency of Markov chain Monte Carlo (MCMC) algorithms. More precisely, the large deviations rate of the probability that the Monte Carlo estimator deviates from the true by a certain distance is used as a measure of efficiency of a particular MCMC algorithm. Numerical methods are proposed for the computation of the rate function based on samples of the renewal cycles of the Markov chain. Furthermore the efficiency measure is applied to an array of MCMC schemes to determine their optimal tuning parameters.Statisticsxy2139StatisticsDissertationsStatistical Inference and Experimental Design for Q-matrix Based Cognitive Diagnosis Models
http://academiccommons.columbia.edu/catalog/ac:176169
Zhang, Stephaniehttp://dx.doi.org/10.7916/D8TQ5ZP5Mon, 07 Jul 2014 00:00:00 +0000There has been growing interest in recent years in using cognitive diagnosis models for diagnostic measurement, i.e., classification according to multiple discrete latent traits. The Q-matrix, an incidence matrix specifying the presence or absence of a relationship between each item in the assessment and each latent attribute, is central to many of these models. Important applications include educational and psychological testing; demand in education, for example, has been driven by recent focus on skills-based evaluation. However, compared to more traditional models coming from classical test theory and item response theory, cognitive diagnosis models are relatively undeveloped and suffer from several issues limiting their applicability. This thesis exams several issues related to statistical inference and experimental design for Q-matrix based cognitive diagnosis models. We begin by considering one of the main statistical issues affecting the practical use of Q-matrix based cognitive diagnosis models, the identifiability issue. In statistical models, identifiability is prerequisite for most common statistical inferences, including parameter estimation and hypothesis testing. With Q-matrix based cognitive diagnosis models, identifiability also affects the classification of respondents according to their latent traits. We begin by examining the identifiability of model parameters, presenting necessary and sufficient conditions for identifiability in several settings. Depending on the area of application and the researcher's degree of control over the experiment design, fulfilling these identifiability conditions may be difficult. The second part of this thesis proposes new methods for parameter estimation and respondent classification for use with non-identifiable models. In addition, our framework allows consistent estimation of the severity of the non-identifiability problem, in terms of the proportion of the population affected by it. The implications of this measure for the design of diagnostic assessments are also discussed.Statistics, Educational tests and measurements, Quantitative psychology and psychometricsStatisticsDissertationsUnbiased Penetrance Estimates with Unknown Ascertainment Strategies
http://academiccommons.columbia.edu/catalog/ac:175879
Gore, Kristenhttp://dx.doi.org/10.7916/D8KP8098Mon, 07 Jul 2014 00:00:00 +0000Allelic variation in the genome leads to variation in individuals' production of proteins. This, in turn, leads to variation in traits and development, and, in some cases, to diseases. Understanding the genetic basis for disease can aid in the search for therapies and in guiding genetic counseling. Thus, it is of interest to discover the genes with mutations responsible for diseases and to understand the impact of allelic variation at those genes. A subject's genetic composition is commonly referred to as the subject's genotype. Subjects who carry the gene mutation of interests are referred to as carriers. Subjects who are afflicted with a disease under study (that is, subjects who exhibit the phenotype) are termed affected carriers. The age-specific probability that a given subject will exhibit a phenotype of interest, given mutation status at a gene is known as penetrance. Understanding penetrance is an important facet of genetic epidemiology. Penetrance estimates are typically calculated via maximum likelihood from family data. However, penetrance estimates can be biased if the nature of the sampling strategy is not correctly reflected in the likelihood. Unfortunately, sampling of family data may be conducted in a haphazard fashion or, even if conducted systematically, might be reported in an incomplete fashion. Bias is possible in applying likelihood methods to reported data if (as is commonly the case) some unaffected family members are not represented in the reports. The purpose here is to present an approach to find efficient and unbiased penetrance estimates in cases where there is incomplete knowledge of the sampling strategy and incomplete information on the full pedigree structure of families included in the data. The method may be applied with different conjectural assumptions about the ascertainment strategy to balance the possibly biasing effects of wishful assumptions about the sampling strategy with the efficiency gains that could be obtained through valid assumptions.StatisticsStatisticsDissertationsA Point Process Model for the Dynamics of Limit Order Books
http://academiccommons.columbia.edu/catalog/ac:171221
Vinkovskaya, Ekaterinahttp://dx.doi.org/10.7916/D88913WWFri, 28 Feb 2014 00:00:00 +0000This thesis focuses on the statistical modeling of the dynamics of limit order books in electronic equity markets. The statistical properties of events affecting a limit order book -market orders, limit orders and cancellations- reveal strong evidence of clustering in time, cross-correlation across event types and dependence of the order flow on the bid-ask spread. Further investigation reveals the presence of a self-exciting property - that a large number of events in a given time period tends to imply a higher probability of observing a large number of events in the following time period. We show that these properties may be adequately represented by a multivariate self-exciting point process with multiple regimes that reflect changes in the bid-ask spread. We propose a tractable parametrization of the model and perform a Maximum Likelihood Estimation of the model using high-frequency data from the Trades and Quotes database for US stocks. We show that the model may be used to obtain predictions of order flow and that its predictive performance beats the Poisson model as well as Moving Average and Auto Regressive time series models.StatisticsStatisticsDissertationsMixed Methods for Mixed Models
http://academiccommons.columbia.edu/catalog/ac:169644
Dorie, Vincent J.http://dx.doi.org/10.7916/D8V40S5XWed, 22 Jan 2014 00:00:00 +0000This work bridges the frequentist and Bayesian approaches to mixed models by borrowing the best features from both camps: point estimation procedures are combined with priors to obtain accurate, fast inference while posterior simulation techniques are developed that approximate the likelihood with great precision for the purposes of assessing uncertainty. These allow flexible inferences without the need to rely on expensive Markov chain Monte Carlo simulation techniques. Default priors are developed and evaluated in a variety of simulation and real-world settings with the end result that we propose a new set of standard approaches that yield superior performance at little computational cost.StatisticsStatisticsDissertationsKernel-based association measures
http://academiccommons.columbia.edu/catalog/ac:167034
Liu, Yinghttp://hdl.handle.net/10022/AC:P:22154Thu, 07 Nov 2013 00:00:00 +0000Measures of associations have been widely used for describing the statistical relationships between two sets of variables. Traditional association measures tend to focus on specialized settings (specific types of variables or association patterns). Based on an in-depth summary of existing measures, we propose a general framework for association measures unifying existing methods and novel extensions based on kernels, including practical solutions to computational challenges. The proposed framework provides improved feature selection and extensions to a variety of current classifiers. Specifically, we introduce association screening and variable selection via maximizing kernel-based association measures. We also develop a backward dropping procedure for feature selection when there are a large number of candidate variables. We evaluate our framework using a wide variety of both simulated and real data. In particular, we conduct independence tests and feature selection using kernel association measures on diversified association patterns of different dimensions and variable types. The results show the superiority of our methods to existing ones. We also apply our framework to four real-word problems, three from statistical genetics and one of gender prediction from handwriting. We demonstrate through these applications both the de novo construction of new kernels and the adaptation of existing kernels tailored to the data at hand, and how kernel-based measures of associations can be naturally applied to different data structures including functional input and output spaces. This shows that our framework can be applied to a wide range of real world problems and work well in practice.Statistics, Computer scienceyl2802StatisticsDissertationsInference of functional neural connectivity and convergence acceleration methods
http://academiccommons.columbia.edu/catalog/ac:179409
Nikitchenko, Maxim V.http://hdl.handle.net/10022/AC:P:22052Thu, 31 Oct 2013 00:00:00 +0000The knowledge of the maps of neuronal interactions is key for system neuroscience, but at the moment we possess relatively little of it . The recent development of experimental methods which allow a simultaneous recording of the spiking activity, but not the intracellular voltage, of thousands of neurons gives us an opportunity to start filling that gap. In Chapter 2, I present a method for the inference of the parameters of the leaky integrate-and-fire (LIF) model featuring time-dependent currents and conductances based only on the extracellular recording of spiking in the network. The fitted parameters can describe the functional connections in the network, as well as the internal properties of the cells. The method can also be used to determine whether a single-compartment model of a neuron should include conductance- or current-based synapses, or their mixture. In addition, because the same mathematical model describes some of the flavors of the Drift Diffusion Model (DDM), popular in the studies of decision making process, the presented method can be readily used to fit their parameters. Making the proposed inference procedure -- based on the expectation-maximization (EM) algorithm -- accurate and robust, necessitated a development of a new numerical adaptive-grid (AG) method for the forward-backward (FB) propagation of the probability density, which is required in the computation of the sufficient statistic in the EM algorithm. These topics are covered in Chapter 3. Another issue which had to be addressed in order to obtain a usable inference algorithm is the well known slow convergence of the EM algorithm in the flat regions of the loglikelihood. Two complementary approaches to this issue are presented in this dissertation. In Chapter 4, I present a new framework for the acceleration of convergence of iterative algorithms (not limited to the EM) which unifies all previously known methods and allows us to construct a new method demonstrating the best performance of them all. To make the computations even faster, I wrote a Matlab package which allows them to be done in parallel on several machines and clusters. As one can see, all the aforementioned projects were sprouted up from one "head" project on the inference of the LIF model parameters. At the end of the dissertation, I briefly describe a disconnected project which is devoted to the development of a flexible experimental setup (software and hardware) for behavioral experiments, with a specific application to a particular type of the virtual Morris water maze experiment (VMWM).Neurosciences, Statisticsmvn2104Statistics, Neurobiology and BehaviorDissertationsLow-rank graphical models and Bayesian inference in the statistical analysis of noisy neural data
http://academiccommons.columbia.edu/catalog/ac:166472
Smith, Carl Alexanderhttp://hdl.handle.net/10022/AC:P:21991Fri, 11 Oct 2013 00:00:00 +0000We develop new methods of Bayesian inference, largely in the context of analysis of neuroscience data. The work is broken into several parts. In the first part, we introduce a novel class of joint probability distributions in which exact inference is tractable. Previously it has been difficult to find general constructions for models in which efficient exact inference is possible, outside of certain classical cases. We identify a class of such models that are tractable owing to a certain "low-rank" structure in the potentials that couple neighboring variables. In the second part we develop methods to quantify and measure information loss in analysis of neuronal spike train data due to two types of noise, making use of the ideas developed in the first part. Information about neuronal identity or temporal resolution may be lost during spike detection and sorting, or precision of spike times may be corrupted by various effects. We quantify the information lost due to these effects for the relatively simple but sufficiently broad class of Markovian model neurons. We find that decoders that model the probability distribution of spike-neuron assignments significantly outperform decoders that use only the most likely spike assignments. We also apply the ideas of the low-rank models from the first section to defining a class of prior distributions over the space of stimuli (or other covariate) which, by conjugacy, preserve the tractability of inference. In the third part, we treat Bayesian methods for the estimation of sparse signals, with application to the locating of synapses in a dendritic tree. We develop a compartmentalized model of the dendritic tree. Building on previous work that applied and generalized ideas of least angle regression to obtain a fast Bayesian solution to the resulting estimation problem, we describe two other approaches to the same problem, one employing a horseshoe prior and the other using various spike-and-slab priors. In the last part, we revisit the low-rank models of the first section and apply them to the problem of inferring orientation selectivity maps from noisy observations of orientation preference. The relevant low-rank model exploits the self-conjugacy of the von Mises distribution on the circle. Because the orientation map model is loopy, we cannot do exact inference on the low-rank model by the forward backward algorithm, but block-wise Gibbs sampling by the forward backward algorithm speeds mixing. We explore another von Mises coupling potential Gibbs sampler that proves to effectively smooth noisily observed orientation maps.Statistics, Neurosciencescas2207Statistics, ChemistryDissertationsGeneralized Volatility-Stabilized Processes
http://academiccommons.columbia.edu/catalog/ac:165162
Pickova, Radkahttp://hdl.handle.net/10022/AC:P:21616Fri, 13 Sep 2013 00:00:00 +0000In this thesis, we consider systems of interacting diffusion processes which we call Generalized Volatility-Stabilized processes, as they extend the Volatility-Stabilized Market models introduced in Fernholz and Karatzas (2005). First, we show how to construct a weak solution of the underlying system of stochastic differential equations. In particular, we express the solution in terms of time-changed squared-Bessel processes and argue that this solution is unique in distribution. In addition, we also discuss sufficient conditions under which this solution does not explode in finite time, and provide sufficient conditions for pathwise uniqueness and for existence of a strong solution. Secondly, we discuss the significance of these processes in the context of Stochastic Portfolio Theory. We describe specific market models which assume that the dynamics of the stocks' capitalizations is the same as that of the Generalized Volatility-Stabilized processes, and we argue that strong relative arbitrage opportunities may exist in these markets, specifically, we provide multiple examples of portfolios that outperform the market portfolio. Moreover, we examine the properties of market weights as well as the diversity weighted portfolio in these models. Thirdly, we provide some asymptotic results for these processes which allows us to describe different properties of the corresponding market models based on these processes.Statisticsrp2424Statistics, MathematicsDissertationsCredit Risk Modeling and Analysis Using Copula Method and Changepoint Approach to Survival Data
http://academiccommons.columbia.edu/catalog/ac:161682
Qian, Bohttp://hdl.handle.net/10022/AC:P:20510Thu, 30 May 2013 00:00:00 +0000This thesis consists of two parts. The first part uses Gaussian Copula and Student's t Copula as the main tools to model the credit risk in securitizations and re-securitizations. The second part proposes a statistical procedure to identify changepoints in Cox model of survival data. The recent 2007-2009 financial crisis has been regarded as the worst financial crisis since the Great Depression by leading economists. The securitization sector took a lot of blame for the crisis because of the connection of the securitized products created from mortgages to the collapse of the housing market. The first part of this thesis explores the relationship between securitized mortgage products and the 2007-2009 financial crisis using the Copula method as the main tool. We show in this part how loss distributions of securitizations and re-securitizations can be derived or calculated in a new model. Simulations are conducted to examine the effectiveness of the model. As an application, the model is also used to examine whether and where the ratings of securitized products could be flawed. On the other hand, the lag effect and saturation effect problems are common and important problems in survival data analysis. They belong to a general class of problems where the treatment effect takes occasional jumps instead of staying constant throughout time. Therefore, they are essentially the changepoint problems in statistics. The second part of this thesis focuses on extending Lai and Xing's recent work in changepoint modeling, which was developed under a time series and Bayesian setup, to the lag effect problems in survival data. A general changepoint approach for Cox model is developed. Simulations and real data analyses are conducted to illustrate the effectiveness of the procedure and how it should be implemented and interpreted.Statisticsbq2102StatisticsDissertationsEstimation and Testing Methods for Monotone Transformation Models
http://academiccommons.columbia.edu/catalog/ac:188499
Zhang, Junyihttp://dx.doi.org/10.7916/D8348JQDThu, 23 May 2013 00:00:00 +0000This thesis deals with a general class of transformation models that contains many important semiparametric regression models as special cases. It develops a self-induced smoothing method for estimating the regression coefficients of these models, resulting in simultaneous point and variance estimations. The self-induced smoothing does not require bandwidth selection, yet provides the right amount of smoothness so that the estimator is asymptotically normal with mean zero (unbiased) and variance-covariance matrix consistently estimated by the usual sandwich-type estimator. An iterative algorithm is given for the variance estimation and shown to numerically converge to a consistent limiting variance estimator. The self-induced smoothing method is also applied to selecting the non-zero regression coefficients for the monotone transformation models. The resulting regularized estimator is shown to be root-n-consistent and achieve desirable sparsity and asymptotic normality under certain regularity conditions. The smoothing technique is used to estimate the monotone transformation function as well. The smoothed rank-based estimate of the transformation function is uniformly consistent and converges weakly to a Gaussian process which is the same as the limiting process for that without smoothing. An explicit covariance function estimate is obtained by using the smoothing technique, and shown to be consistent. The estimation of the transformation function reduces the multiple hypotheses testing problems for the monotone transformation models to those for linear models. A new hypotheses testing procedure is proposed in this thesis for linear models and shown to be more powerful than some widely-used testing methods when there is a strong collinearity in data. It is proved that the new testing procedure controls the family-wise error rate.Statisticsjz2299StatisticsDissertationsOn optimal arbitrage under constraints
http://academiccommons.columbia.edu/catalog/ac:160495
Sadhukhan, Subhankarhttp://hdl.handle.net/10022/AC:P:20076Wed, 01 May 2013 00:00:00 +0000In this thesis, we investigate the existence of relative arbitrage opportunities in a Markovian model of a financial market, which consists of a bond and stocks, whose prices evolve like Itô processes. We consider markets where investors are constrained to choose from among a restricted set of investment strategies. We show that the upper hedging price of (i.e. the minimum amount of wealth needed to superreplicate) a given contingent claim in a constrained market can be expressed as the supremum of the fair price of the given contingent claim under certain unconstrained auxiliary Markovian markets. Under suitable assumptions, we further characterize the upper hedging price as viscosity solution to certain variational inequalities. We, then, use this viscosity solution characterization to study how the imposition of stricter constraints on the market affect the upper hedging price. In particular, if relative arbitrage opportunities exist with respect to a given strategy, we study how stricter constraints can make such arbitrage opportunities disappear.Applied mathematics, Financess3240Statistics, MathematicsDissertationsStatistical Inference for Diagnostic Classification Models
http://academiccommons.columbia.edu/catalog/ac:160464
Xu, Gongjunhttp://hdl.handle.net/10022/AC:P:20058Tue, 30 Apr 2013 00:00:00 +0000Diagnostic classification models (DCM) are an important recent development in educational and psychological testing. Instead of an overall test score, a diagnostic test provides each subject with a profile detailing the concepts and skills (often called "attributes") that he/she has mastered. Central to many DCMs is the so-called Q-matrix, an incidence matrix specifying the item-attribute relationship. It is common practice for the Q-matrix to be specified by experts when items are written, rather than through data-driven calibration. Such a non-empirical approach may lead to misspecification of the Q-matrix and substantial lack of model fit, resulting in erroneous interpretation of testing results. This motivates our study and we consider the identifiability, estimation, and hypothesis testing of the Q-matrix. In addition, we study the identifiability of diagnostic model parameters under a known Q-matrix. The first part of this thesis is concerned with estimation of the Q-matrix. In particular, we present definitive answers to the learnability of the Q-matrix for one of the most commonly used models, the DINA model, by specifying a set of sufficient conditions under which the Q-matrix is identifiable up to an explicitly defined equivalence class. We also present the corresponding data-driven construction of the Q-matrix. The results and analysis strategies are general in the sense that they can be further extended to other diagnostic models. The second part of the thesis focuses on statistical validation of the Q-matrix. The purpose of this study is to provide a statistical procedure to help decide whether to accept the Q-matrix provided by the experts. Statistically, this problem can be formulated as a pure significance testing problem with null hypothesis H0 : Q = Q0, where Q0 is the candidate Q-matrix. We propose a test statistic that measures the consistency of observed data with the proposed Q-matrix. Theoretical properties of the test statistic are studied. In addition, we conduct simulation studies to show the performance of the proposed procedure. The third part of this thesis is concerned with the identifiability of the diagnostic model parameters when the Q-matrix is correctly specified. Identifiability is a prerequisite for statistical inference, such as parameter estimation and hypothesis testing. We present sufficient and necessary conditions under which the model parameters are identifiable from the response data.Statistics, Educational tests and measurementsgx2108StatisticsDissertationsBayesian Model Selection in terms of Kullback-Leibler discrepancy
http://academiccommons.columbia.edu/catalog/ac:158374
Zhou, Shouhaohttp://hdl.handle.net/10022/AC:P:19157Mon, 25 Feb 2013 00:00:00 +0000In this article we investigate and develop the practical model assessment and selection methods for Bayesian models, when we anticipate that a promising approach should be objective enough to accept, easy enough to understand, general enough to apply, simple enough to compute and coherent enough to interpret. We mainly restrict attention to the Kullback-Leibler divergence, a widely applied model evaluation measurement to quantify the similarity between the proposed candidate model and the underlying true model, where the true model is only referred to a probability distribution as the best projection onto the statistical modeling space once we try to understand the real but unknown dynamics/mechanism of interest. In addition to review and discussion on the advantages and disadvantages of the historically and currently prevailing practical model selection methods in literature, a series of convenient and useful tools, each designed and applied for different purposes, are proposed to asymptotically unbiasedly assess how the candidate Bayesian models are favored in terms of predicting a future independent observation. What's more, we also explore the connection of the Kullback-Leibler based information criterion to the Bayes factors, another most popular Bayesian model comparison approaches, after seeing the motivation through the developments of the Bayes factor variants. In general, we expect to provide a useful guidance for researchers who are interested in conducting Bayesian data analysis.Statisticssz2020StatisticsDissertationsContributions to Semiparametric Inference to Biased-Sampled and Financial Data
http://academiccommons.columbia.edu/catalog/ac:177018
Sit, Tonyhttp://hdl.handle.net/10022/AC:P:14685Wed, 12 Sep 2012 00:00:00 +0000This thesis develops statistical models and methods for the analysis of life-time and financial data under the umbrella of semiparametric framework. The first part studies the use of empirical likelihood on Levy processes that are used to model the dynamics exhibited in the financial data. The second part is a study of inferential procedure for survival data collected under various biased sampling schemes in transformation and the accelerated failure time models. During the last decade Levy processes with jumps have received increasing popularity for modelling market behaviour for both derivative pricing and risk management purposes. Chan et al. (2009) introduced the use of empirical likelihood methods to estimate the parameters of various diffusion processes via their characteristic functions which are readily available in most cases. Return series from the market are used for estimation. In addition to the return series, there are many derivatives actively traded in the market whose prices also contain information about parameters of the underlying process. This observation motivates us to combine the return series and the associated derivative prices observed at the market so as to provide a more reflective estimation with respect to the market movement and achieve a gain in efficiency. The usual asymptotic properties, including consistency and asymptotic normality, are established under suitable regularity conditions. We performed simulation and case studies to demonstrate the feasibility and effectiveness of the proposed method. The second part of this thesis investigates a unified estimation method for semiparametric linear transformation models and accelerated failure time model under general biased sampling schemes. The methodology proposed is first investigated in Paik (2009) in which the length-biased case is considered for transformation models. The new estimator is obtained from a set of counting process-based unbiased estimating equations, developed through introducing a general weighting scheme that offsets the sampling bias. The usual asymptotic properties, including consistency and asymptotic normality, are established under suitable regularity conditions. A closed-form formula is derived for the limiting variance and the plug-in estimator is shown to be consistent. We demonstrate the unified approach through the special cases of left truncation, length-bias, the case-cohort design and variants thereof. Simulation studies and applications to real data sets are also presented.Statisticsts2500StatisticsDissertationsDetecting Dependence Change Points in Multivariate Time Series with Applications in Neuroscience and Finance
http://academiccommons.columbia.edu/catalog/ac:177012
Cribben, Ivor Johnhttp://hdl.handle.net/10022/AC:P:14681Wed, 12 Sep 2012 00:00:00 +0000In many applications there are dynamic changes in the dependency structure between multivariate time series. Two examples include neuroscience and finance. The second and third chapters focus on neuroscience and introduce a data-driven technique for partitioning a time course into distinct temporal intervals with different multivariate functional connectivity patterns between a set of brain regions of interest (ROIs). The technique, called Dynamic Connectivity Regression (DCR), detects temporal change points in functional connectivity and estimates a graph, or set of relationships between ROIs, for data in the temporal partition that falls between pairs of change points. Hence, DCR allows for estimation of both the time of change in connectivity and the connectivity graph for each partition, without requiring prior knowledge of the nature of the experimental design. Permutation and bootstrapping methods are used to perform inference on the change points. In the second chapter of this work, we focus on multi-subject data while in the third chapter, we concentrate on single-subject data and extend the DCR methodology in two ways: (i) we alter the algorithm to make it more accurate for individual subject data with a small number of observations and (ii) we perform inference on the edges or connections between brain regions in order to reduce the number of false positives in the graphs. We also discuss a Likelihood Ratio test to compare precision matrices (inverse covariance matrices) across subjects as well as a test across subjects on the single edges or partial correlations in the graph. In the final chapter of this work, we turn to a finance setting. We use the same DCR technique to detect changes in dependency structure in multivariate financial time series for situations where both the placement and number of change points is unknown. In this setting, DCR finds the dependence change points and estimates an undirected graph representing the relationship between time series within each interval created by pairs of adjacent change points. A shortcoming of the proposed DCR methodology is the presence of an excessive number of false positive edges in the undirected graphs, especially when the data deviates from normality. Here we address this shortcoming by proposing a procedure for performing inference on the edges, or partial dependencies between time series, that effectively removes false positive edges. We also discuss two robust estimation procedures based on ranks and the tlasso (Finegold and Drton, 2011) technique, which we contrast with the glasso technique used by DCR.Statisticsijc2104StatisticsDissertationsMethods for studying the neural code in high dimensions
http://academiccommons.columbia.edu/catalog/ac:152510
Ramirez, Alexandro D.http://hdl.handle.net/10022/AC:P:14688Wed, 12 Sep 2012 00:00:00 +0000Over the last two decades technological developments in multi-electrode arrays and fluorescence microscopy have made it possible to simultaneously record from hundreds to thousands of neurons. Developing methods for analyzing these data in order to learn how networks of neurons respond to external stimuli and process information is an outstanding challenge for neuroscience. In this dissertation, I address the challenge of developing and testing models that are both flexible and computationally tractable when used with high dimensional data. In chapter 2 I will discuss an approximation to the generalized linear model (GLM) log-likelihood that I developed in collaboration with my thesis advisor. This approximation is designed to ease the computational burden of evaluating GLMs. I will show that our method reduces the computational cost of evaluating the GLM log-likelihood by a factor proportional to the number of parameters in the model times the number of observations. Therefore it is most beneficial in typical neuroscience applications where the number of parameters is large. I then detail a variety of applications where our method can be of use, including Maximum Likelihood estimation of GLM parameters, marginal likelihood calculations for model selection and Markov chain Monte Carlo methods for sampling from posterior parameter distributions. I go on to show that our model does not necessarily sacrifice accuracy for speed. Using both analytic calculations and multi-unit, primate retinal responses, I show that parameter estimates and predictions using our model can have the same accuracy as that of generalized linear models. In chapter 3 I study the neural decoding problem of predicting stimuli from neuronal responses. The focus is on reconstructing zebra finch song spectrograms, which are high-dimensional, by combining the spike trains of zebra finch auditory midbrain neurons with information about the correlations present in all zebra finch song. I use a GLM to model neuronal responses and a series of prior distributions, each carrying different amounts of statistical information about zebra finch song. For song reconstruction I make use of recent connections made between the applied mathematics literature on solving linear systems of equations involving matrices with special structure and neural decoding. This allowed me to calculate \textit{maximum a posteriori} (MAP) estimates of song spectrograms in a time that only grows linearly, and is therefore quite tractable, with the number of time-bins in the song spectrogram. This speed was beneficial for answering questions which required the reconstruction of a variety of song spectrograms each corresponding to different priors made on the distribution of zebra finch song. My collaborators and I found that spike trains from a population of MLd neurons combined with an uncorrelated Gaussian prior can estimate the amplitude envelope of song spectrograms. The same set of responses can be combined with Gaussian priors that have correlations matched to those found across multiple zebra finch songs to yield song spectrograms similar to those presented to the animal. The fidelity of spectrogram reconstructions from MLd responses relies more heavily on prior knowledge of spectral correlations than temporal correlations. However the best reconstructions combine MLd responses with both spectral and temporal correlations.Neurosciencesadr2110Statistics, Neurobiology and Behavior, NeuroscienceDissertationsModeling Strategies for Large Dimensional Vector Autoregressions
http://academiccommons.columbia.edu/catalog/ac:152472
Zang, Pengfeihttp://hdl.handle.net/10022/AC:P:14666Tue, 11 Sep 2012 00:00:00 +0000The vector autoregressive (VAR) model has been widely used for describing the dynamic behavior of multivariate time series. However, fitting standard VAR models to large dimensional time series is challenging primarily due to the large number of parameters involved. In this thesis, we propose two strategies for fitting large dimensional VAR models. The first strategy involves reducing the number of non-zero entries in the autoregressive (AR) coefficient matrices and the second is a method to reduce the effective dimension of the white noise covariance matrix. We propose a 2-stage approach for fitting large dimensional VAR models where many of the AR coefficients are zero. The first stage provides initial selection of non-zero AR coefficients by taking advantage of the properties of partial spectral coherence (PSC) in conjunction with BIC. The second stage, based on $t$-ratios and BIC, further refines the spurious non-zero AR coefficients post first stage. Our simulation study suggests that the 2-stage approach outperforms Lasso-type methods in discovering sparsity patterns in AR coefficient matrices of VAR models. The performance of our 2-stage approach is also illustrated with three real data examples. Our second strategy for reducing the complexity of a large dimensional VAR model is based on a reduced-rank estimator for the white noise covariance matrix. We first derive the reduced-rank covariance estimator under the setting of independent observations and give the analytical form of its maximum likelihood estimate. Then we describe how to integrate the proposed reduced-rank estimator into the fitting of large dimensional VAR models, where we consider two scenarios that require different model fitting procedures. In the VAR modeling context, our reduced-rank covariance estimator not only provides interpretable descriptions of the dependence structure of VAR processes but also leads to improvement in model-fitting and forecasting over unrestricted covariance estimators. Two real data examples are presented to illustrate these fitting procedures.Statisticspz2146StatisticsDissertationsSome Models for Time Series of Counts
http://academiccommons.columbia.edu/catalog/ac:152149
Liu, Henghttp://hdl.handle.net/10022/AC:P:14561Wed, 29 Aug 2012 00:00:00 +0000This thesis focuses on developing nonlinear time series models and establishing relevant theory with a view towards applications in which the responses are integer valued. The discreteness of the observations, which is not appropriate with classical time series models, requires novel modeling strategies. The majority of the existing models for time series of counts assume that the observations follow a Poisson distribution conditional on an accompanying intensity process that drives the serial dynamics of the model. According to whether the evolution of the intensity process depends on the observations or solely on an external process, the models are classified into parameter-driven and observation-driven. Compared to the former one, an observation-driven model often allows for easier and more straightforward estimation of the model parameters. On the other hand, the stability properties of the process, such as the existence and uniqueness of a stationary and ergodic solution that are required for deriving asymptotic theory of the parameter estimates, can be quite complicated to establish, as compared to parameter-driven models. In this thesis, we first propose a broad class of observation-driven models that is based upon a one-parameter exponential family of distributions and incorporates nonlinear dynamics. The establishment of stability properties of these processes, which is at the heart of this thesis, is addressed by employing theory from iterated random functions and coupling techniques. Using this theory, we are also able to obtain the asymptotic behavior of maximum likelihood estimates of the parameters. Extensions of the base model in several directions are considered. Inspired by the idea of a self-excited threshold ARMA process, a threshold Poisson autoregression is proposed. It introduces a two-regime structure in the intensity process and essentially allows for modeling negatively correlated observations. E-chain, a non-standard Markov chain technique and Lyapunov's method are utilized to show the stationarity and a law of large numbers for this process. In addition, the model has been adapted to incorporate covariates, an important problem of practical and primary interest. The base model is also extended to consider the case of multivariate time series of counts. Given a suitable definition of a multivariate Poisson distribution, a multivariate Poisson autoregression process is described and its properties studied. Several simulation studies are presented to illustrate the inference theory. The proposed models are also applied to several real data sets, including the number of transactions of the Ericsson stock, the return times of Goldman Sachs Group stock prices, the number of road crashes in Schiphol, the frequencies of occurrences of gold particles, the incidences of polio in the US and the number of presentations of asthma in an Australian hospital. An array of graphical and quantitative diagnostic tools, which is specifically designed for the evaluation of goodness of fit for time series of counts models, is described and illustrated with these data sets.Statisticshl2494StatisticsDissertationsStatistical inference in two non-standard regression problems
http://academiccommons.columbia.edu/catalog/ac:151460
Seijo, Emilio Franciscohttp://hdl.handle.net/10022/AC:P:14317Wed, 08 Aug 2012 00:00:00 +0000This thesis analyzes two regression models in which their respective least squares estimators have nonstandard asymptotics. It is divided in an introduction and two parts. The introduction motivates the study of nonstandard problems and presents an outline of the contents of the remaining chapters. In part I, the least squares estimator of a multivariate convex regression function is studied in great detail. The main contribution here is a proof of the consistency of the aforementioned estimator in a completely nonparametric setting. Model misspecification, local rates of convergence and multidimensional regression models mixing convexity and componentwise monotonicity constraints will also be considered. Part II deals with change-point regression models and the issues that might arise when applying the bootstrap to these problems. The classical bootstrap is shown to be inconsistent on a simple change-point regression model, and an alternative (smoothed) bootstrap procedure is proposed and proved to be consistent. The superiority of the alternative method is also illustrated through a simulation study. In addition, a version of the continuous mapping theorem specially suited for change-point estimators is proved and used to derive the results concerning the bootstrap.Statistics, Applied mathematics, Mathematicsefs2113StatisticsDissertationsGenus Distributions of Graphs Constructed Through Amalgamations
http://academiccommons.columbia.edu/catalog/ac:146091
Poshni, Mehvish Irfanhttp://hdl.handle.net/10022/AC:P:12989Thu, 12 Apr 2012 00:00:00 +0000Graphs are commonly represented as points in space connected by lines. The points in space are the vertices of the graph, and the lines joining them are the edges of the graph. A general definition of a graph is considered here, where multiple edges are allowed between two vertices and an edge is permitted to connect a vertex to itself. It is assumed that graphs are connected, i.e., any vertex in the graph is reachable from another distinct vertex either directly through an edge connecting them or by a path consisting of intermediate vertices and connecting edges. Under this visual representation, graphs can be drawn on various surfaces. The focus of my research is restricted to a class of surfaces that are characterized as compact connected orientable 2-manifolds. The drawings of graphs on surfaces that are of primary interest follow certain prescribed rules. These are called 2-cellular graph embeddings, or simply embeddings. A well-known closed formula makes it easy to enumerate the total number of 2-cellular embeddings for a given graph over all surfaces. A much harder task is to give a surface-wise breakdown of this number as a sequence of numbers that count the number of 2-cellular embeddings of a graph for each orientable surface. This sequence of numbers for a graph is known as the genus distribution of a graph. Prior research on genus distributions of graphs has primarily focused on making calculations of genus distributions for specific families of graphs. These families of graphs have often been contrived, and the methods used for finding their genus distributions have not been general enough to extend to other graph families. The research I have undertaken aims at developing and using a general method that frames the problem of calculating genus distributions of large graphs in terms of a partitioning of the genus distributions of smaller graphs. To this end, I use various operations such as edge-amalgamation, self-edge-amalgamation, and vertex-amalgamation to construct large graphs out of smaller graphs, by coupling their vertices and edges together in certain consistent ways. This method assumes that the partitioned genus distribution of the smaller graphs is known or is easily calculable by computer, for instance, by using the famous Heffter-Edmonds algorithm. As an outcome of the techniques used, I obtain general recurrences and closed-formulas that give genus distributions for infinitely many recursively specifiable graph families. I also give an easily understood method for finding non-trivial examples of distinct graphs having the same genus distribution. In addition to this, I describe an algorithm that computes the genus distributions for a family of graphs known as the 4-regular outerplanar graphs.Computer sciencemp2452Statistics, Computer ScienceDissertationsHigh dimensional information processing
http://academiccommons.columbia.edu/catalog/ac:139263
Rahnama Rad, Kamiarhttp://hdl.handle.net/10022/AC:P:11287Wed, 28 Sep 2011 00:00:00 +0000Part I: Consider the n-dimensional vector y = Xβ + ǫ where β ∈ Rp has only k nonzero entries and ǫ ∈ Rn is a Gaussian noise. This can be viewed as a linear system with sparsity constraints corrupted by noise, where the objective is to estimate the sparsity pattern of β given the observation vector y and the measurement matrix X. First, we derive a non-asymptotic upper bound on the probability that a specific wrong sparsity pattern is identified by the maximum-likelihood estimator. We find that this probability depends (inversely) exponentially on the difference of kXβk2 and the ℓ2-norm of Xβ projected onto the range of columns of X indexed by the wrong sparsity pattern. Second, when X is randomly drawn from a Gaussian ensemble, we calculate a non-asymptotic upper bound on the probability of the maximum-likelihood decoder not declaring (partially) the true sparsity pattern. Consequently, we obtain sufficient conditions on the sample size n that guarantee almost surely the recovery of the true sparsity pattern. We find that the required growth rate of sample size n matches the growth rate of previously established necessary conditions. Part II: Estimating two-dimensional firing rate maps is a common problem, arising in a number of contexts: the estimation of place fields in hippocampus, the analysis of temporally nonstationary tuning curves in sensory and motor areas, the estimation of firing rates following spike-triggered covariance analyses, etc. Here we introduce methods based on Gaussian process nonparametric Bayesian techniques for estimating these two-dimensional rate maps. These techniques offer a number of advantages: the estimates may be computed efficiently, come equipped with natural errorbars, adapt their smoothness automatically to the local density and informativeness of the observed data, and permit direct fitting of the model hyperparameters (e.g., the prior smoothness of the rate map) via maximum marginal likelihood. We illustrate the flexibility and performance of the new techniques on a variety of simulated and real data. Part III: Many fundamental questions in theoretical neuroscience involve optimal decoding and the computation of Shannon information rates in populations of spiking neurons. In this paper, we apply methods from the asymptotic theory of statistical inference to obtain a clearer analytical understanding of these quantities. We find that for large neural populations carrying a finite total amount of information, the full spiking population response is asymptotically as informative as a single observation from a Gaussian process whose mean and covariance can be characterized explicitly in terms of network and single neuron properties. The Gaussian form of this asymptotic sufficient statistic allows us in certain cases to perform optimal Bayesian decoding by simple linear transformations, and to obtain closed-form expressions of the Shannon information carried by the network. One technical advantage of the theory is that it may be applied easily even to non-Poisson point process network models; for example, we find that under some conditions, neural populations with strong history-dependent (non-Poisson) effects carry exactly the same information as do simpler equivalent populations of non-interacting Poisson neurons with matched firing rates. We argue that our findings help to clarify some results from the recent literature on neural decoding and neuroprosthetic design. Part IV: A model of distributed parameter estimation in networks is introduced, where agents have access to partially informative measurements over time. Each agent faces a local identification problem, in the sense that it cannot consistently estimate the parameter in isolation. We prove that, despite local identification problems, if agents update their estimates recursively as a function of their neighbors’ beliefs, they can consistently estimate the true parameter provided that the communication network is strongly connected; that is, there exists an information path between any two agents in the network. We also show that the estimates of all agents are asymptotically normally distributed. Finally, we compute the asymptotic variance of the agents’ estimates in terms of their observation models and the network topology, and provide conditions under which the distributed estimators are as efficient as any centralized estimator.Neurosciences, Applied mathematicskr2248StatisticsDissertationsSelf-controlled methods for postmarketing drug safety surveillance in large-scale longitudinal data
http://academiccommons.columbia.edu/catalog/ac:137551
Simpson, Shawn E.http://hdl.handle.net/10022/AC:P:10963Mon, 22 Aug 2011 00:00:00 +0000A primary objective in postmarketing drug safety surveillance is to ascertain the relationship between time-varying drug exposures and adverse events (AEs) related to health outcomes. Surveillance can be based on longitudinal observational databases (LODs), which contain time-stamped patient-level medical information including periods of drug exposure and dates of diagnoses. Due to its desirable properties, we focus on the self-controlled case series (SCCS) method for analysis in this context. SCCS implicitly controls for fixed multiplicative baseline covariates since each individual acts as their own control. In addition, only exposed cases are required for the analysis, which is computationally advantageous. In the first part of this work we present how the simple SCCS model can be applied to the surveillance problem, and compare the results of simple SCCS to those of existing methods. Many current surveillance methods are based on marginal associations between drug exposures and AEs. Such analyses ignore confounding drugs and interactions and have the potential to give misleading results. In order to avoid these difficulties, it is desirable for an analysis strategy to incorporate large numbers of time-varying potential confounders such as other drugs. In the second part of this work we propose the Bayesian multiple SCCS approach, which deals with high dimensionality and can provide a sparse solution via a Laplacian prior. We present details of the model and optimization procedure, as well as results of empirical investigations. SCCS is based on a conditional Poisson regression model, which assumes that events at different time points are conditionally independent given the covariate process. This requirement is problematic when the occurrence of an event can alter the future event risk. In a clinical setting, for example, patients who have a first myocardial infarction (MI) may be at higher subsequent risk for a second. In the third part of this work we propose the positive dependence self-controlled case series (PD-SCCS) method: a generalization of SCCS that allows the occurrence of an event to increase the future event risk, yet maintains the advantages of the original by controlling for fixed baseline covariates and relying solely on data from cases. We develop the model and compare the results of PD-SCCS and SCCS on example drug-AE pairs.Statisticsses2155StatisticsDissertationsSome Nonparametric Methods for Clinical Trials and High Dimensional Data
http://academiccommons.columbia.edu/catalog/ac:174242
Wu, Xiaoruhttp://hdl.handle.net/10022/AC:P:10335Wed, 11 May 2011 00:00:00 +0000This dissertation addresses two problems from novel perspectives. In chapter 2, I propose an empirical likelihood based method to nonparametrically adjust for baseline covariates in randomized clinical trials and in chapter 3, I develop a survival analysis framework for multivariate K-sample problems. (I): Covariate adjustment is an important tool in the analysis of randomized clinical trials and observational studies. It can be used to increase efficiency and thus power, and to reduce possible bias. While most statistical tests in randomized clinical trials are nonparametric in nature, approaches for covariate adjustment typically rely on specific regression models, such as the linear model for a continuous outcome, the logistic regression model for a dichotomous outcome, and the Cox model for survival time. Several recent efforts have focused on model-free covariate adjustment. This thesis makes use of the empirical likelihood method and proposes a nonparametric approach to covariate adjustment. A major advantage of the new approach is that it automatically utilizes covariate information in an optimal way without fitting a nonparametric regression. The usual asymptotic properties, including the Wilks-type result of convergence to a chi-square distribution for the empirical likelihood ratio based test, and asymptotic normality for the corresponding maximum empirical likelihood estimator, are established. It is also shown that the resulting test is asymptotically most powerful and that the estimator for the treatment effect achieves the semiparametric efficiency bound. The new method is applied to the Global Use of Strategies to Open Occluded Coronary Arteries (GUSTO)-I trial. Extensive simulations are conducted, validating the theoretical findings. This work is not only useful for nonparametric covariate adjustment but also has theoretical value. It broadens the scope of the traditional empirical likelihood inference by allowing the number of constraints to grow with the sample size. (II): Motivated by applications in high-dimensional settings, I propose a novel approach to testing equality of two or more populations by constructing a class of intensity centered score processes. The resulting tests are analogous in spirit to the well-known class of weighted log-rank statistics that is widely used in survival analysis. The test statistics are nonparametric, computationally simple and applicable to high-dimensional data. We establish the usual large sample properties by showing that the underlying log-rank score process converges weakly to a Gaussian random field with zero mean under the null hypothesis, and with a drift under the contiguous alternatives. For the Kolmogorov-Smirnov-type and the Cramer-von Mises-type statistics, we also establish the consistency result for any fixed alternative. As a practical means to obtain approximate cutoff points for the test statistics, a simulation based resampling method is proposed, with theoretical justification given by establishing weak convergence for the randomly weighted log-rank score process. The new approach is applied to a study of brain activation measured by functional magnetic resonance imaging when performing two linguistic tasks and also to a prostate cancer DNA microarray data set.Statisticsxw2144StatisticsDissertationsContagion and Systemic Risk in Financial Networks
http://academiccommons.columbia.edu/catalog/ac:131474
Moussa, Amalhttp://hdl.handle.net/10022/AC:P:10249Fri, 29 Apr 2011 00:00:00 +0000The 2007-2009 financial crisis has shed light on the importance of contagion and systemic risk, and revealed the lack of adequate indicators for measuring and monitoring them. This dissertation addresses these issues and leads to several recommendations for the design of an improved assessment of systemic importance, improved rating methods for structured finance securities, and their use by investors and risk managers. Using a complete data set of all mutual exposures and capital levels of financial institutions in Brazil in 2007 and 2008, we explore in chapter 2 the structure and dynamics of the Brazilian financial system. We show that the Brazilian financial system exhibits a complex network structure characterized by a strong degree of heterogeneity in connectivity and exposure sizes across institutions, which is qualitatively and quantitatively similar to the statistical features observed in other financial systems. We find that the Brazilian financial network is well represented by a directed scale-free network, rather than a small world network. Based on these observations, we propose a stochastic model for the structure of banking networks, representing them as a directed weighted scale free network with power law distributions for in-degree and out-degree of nodes, Pareto distribution for exposures. This model may then be used for simulation studies of contagion and systemic risk in networks. We propose in chapter 3 a quantitative methodology for assessing contagion and systemic risk in a network of interlinked institutions. We introduce the Contagion Index as a metric of the systemic importance of a single institution or a set of institutions, that combines the effects of both common market shocks to portfolios and contagion through counterparty exposures. Using a directed scale-free graph simulation of the financial system, we study the sensitivity of contagion to a change in aggregate network parameters: connectivity, concentration of exposures, heterogeneity in degree distribution and network size. More concentrated and more heterogeneous networks are found to be more resilient to contagion. The impact of connectivity is more controversial: in well-capitalized networks, increasing connectivity improves the resilience to contagion when the initial level of connectivity is high, but increases contagion when the initial level of connectivity is low. In undercapitalized networks, increasing connectivity tends to increase the severity of contagion. We also study the sensitivity of contagion to local measures of connectivity and concentration across counterparties --the counterparty susceptibility and local network frailty-- that are found to have a monotonically increasing relationship with the systemic risk of an institution. Requiring a minimum (aggregate) capital ratio is shown to reduce the systemic impact of defaults of large institutions; we show that the same effect may be achieved with less capital by imposing such capital requirements only on systemically important institutions and those exposed to them. In chapter 4, we apply this methodology to the study of the Brazilian financial system. Using the Contagion Index, we study the potential for default contagion and systemic risk in the Brazilian system and analyze the contribution of balance sheet size and network structure to systemic risk. Our study reveals that, aside from balance sheet size, the network-based local measures of connectivity and concentration of exposures across counterparties introduced in chapter 3, the counterparty susceptibility and local network frailty, contribute significantly to the systemic importance of an institution in the Brazilian network. Thus, imposing an upper bound on these variables could help reducing contagion. We examine the impact of various capital requirements on the extent of contagion in the Brazilian financial system, and show that targeted capital requirements achieve the same reduction in systemic risk with lower requirements in capital for financial institutions. The methodology we proposed in chapter 3 for estimating contagion and systemic risk requires visibility on the entire network structure. Reconstructing bilateral exposures from balance sheets data is then a question of interest in a financial system where bilateral exposures are not disclosed. We propose in chapter 5 two methods to derive a distribution of bilateral exposures matrices. The first method attempts to recover the balance sheet assets and liabilities "sample by sample". Each sample of the bilateral exposures matrix is solution of a relative entropy minimization problem subject to the balance sheet constraints. However, a solution to this problem does not always exist when dealing with sparse sample matrices. Thus, we propose a second method that attempts to recover the assets and liabilities "in the mean". This approach is the analogue of the Weighted Monte Carlo method introduced by Avellaneda et al. (2001). We first simulate independent samples of the bilateral exposures matrix from a relevant prior distribution on the network structure, then we compute posterior probabilities by maximizing the entropy under the constraints that the balance sheet assets and liabilities are recovered in the mean. We discuss the pros and cons of each approach and explain how it could be used to detect systemically important institutions in the financial system. The recent crisis has also raised many questions regarding the meaning of structured finance credit ratings issued by rating agencies and the methodology behind them. Chapter 6 aims at clarifying some misconceptions related to structured finance ratings and how they are commonly interpreted: we discuss the comparability of structured finance ratings with bond ratings, the interaction between the rating procedure and the tranching procedure and its consequences for the stability of structured finance ratings in time. These insights are illustrated in a factor model by simulating rating transitions for CDO tranches using a nested Monte Carlo method. In particular, we show that the downgrade risk of a CDO tranche can be quite different from a bond with same initial rating. Structured finance ratings follow path-dependent dynamics that cannot be adequately described, as usually done, by a matrix of transition probabilities. Therefore, a simple labeling via default probability or expected loss does not discriminate sufficiently their downgrade risk. We propose to supplement ratings with indicators of downgrade risk. To overcome some of the drawbacks of existing rating methods, we suggest a risk-based rating procedure for structured products. Finally, we formulate a series of recommendations regarding the use of credit ratings for CDOs and other structured credit instruments.Finance, Statisticsam2810Industrial Engineering and Operations Research, StatisticsDissertationsOptimal Trading Strategies Under Arbitrage
http://academiccommons.columbia.edu/catalog/ac:131477
Ruf, Johannes Karl Dominikhttp://hdl.handle.net/10022/AC:P:10250Fri, 29 Apr 2011 00:00:00 +0000This thesis analyzes models of financial markets that incorporate the possibility of arbitrage opportunities. The first part demonstrates how explicit formulas for optimal trading strategies in terms of minimal required initial capital can be derived in order to replicate a given terminal wealth in a continuous-time Markovian context. Towards this end, only the existence of a square-integrable market price of risk (rather than the existence of an equivalent local martingale measure) is assumed. A new measure under which the dynamics of the stock price processes simplify is constructed. It is shown that delta hedging does not depend on the "no free lunch with vanishing risk" assumption. However, in the presence of arbitrage opportunities, finding an optimal strategy is directly linked to the non-uniqueness of the partial differential equation corresponding to the Black-Scholes equation. In order to apply these analytic tools, sufficient conditions are derived for the necessary differentiability of expectations indexed over the initial market configuration. The phenomenon of "bubbles," which has been a popular topic in the recent academic literature, appears as a special case of the setting in the first part of this thesis. Several examples at the end of the first part illustrate the techniques contained therein. In the second part, a more general point of view is taken. The stock price processes, which again allow for the possibility of arbitrage, are no longer assumed to be Markovian, but rather only It^o processes. We then prove the Second Fundamental Theorem of Asset Pricing for these markets: A market is complete, meaning that any bounded contingent claim is replicable, if and only if the stochastic discount factor is unique. Conditions under which a contingent claim can be perfectly replicated in an incomplete market are established. Then, precise conditions under which relative arbitrage and strong relative arbitrage with respect to a given trading strategy exist are explicated. In addition, it is shown that if the market is quasi-complete, meaning that any bounded contingent claim measurable with respect to the stock price filtration is replicable, relative arbitrage implies strong relative arbitrage. It is further demonstrated that markets are quasi-complete, subject to the condition that the drift and diffusion coefficients are measurable with respect to the stock price filtration.Mathematics, Financejkr2115Statistics, MathematicsDissertationsTwo Approaches to Non-Zero-Sum Stochastic Differential Games of Control and Stopping
http://academiccommons.columbia.edu/catalog/ac:131462
Li, Qinghuahttp://hdl.handle.net/10022/AC:P:10245Fri, 29 Apr 2011 00:00:00 +0000This dissertation takes two approaches - martingale and backward stochastic differential equation (BSDE) - to solve non-zero-sum stochastic differential games in which all players can control and stop the reward streams of the games. Existence of equilibrium stopping rules is proved under some assumptions. The martingale part provides an equivalent martingale characterization of Nash equilibrium strategies of the games. When using equilibrium stopping rules, Isaacs' condition is necessary and sufficient for the existence of an equilibrium control set. The BSDE part shows that solutions to BSDEs provide value processes of the games. A multidimensional BSDE with reflecting barrier is studied in two cases for its solution: existence and uniqueness with Lipschitz growth, and existence in a Markovian system with linear growth rate.Mathematicsql2133Statistics, MathematicsDissertationsStatistical methods for indirectly observed network data
http://academiccommons.columbia.edu/catalog/ac:131447
McCormick, Tyler H.http://hdl.handle.net/10022/AC:P:10239Fri, 29 Apr 2011 00:00:00 +0000Social networks have become an increasingly common framework for understanding and explaining social phenomena. Yet, despite an abundance of sophisticated models, social network research has yet to realize its full potential, in part because of the difficulty of collecting social network data. In many cases, particularly in the social sciences, collecting complete network data is logistically and financially challenging. In contrast, Aggregated Relational Data (ARD) measure network structure indirectly by asking respondents how many connections they have with members of a certain subpopulation (e.g. How many individuals with HIV/AIDS do you know?). These data require no special sampling procedure and are easily incorporated into existing surveys. This research develops a latent space model for ARD. This dissertation proposes statistical methods for methods for estimating social network and population characteristics using one type of social network data collected using standard surveys. First, a method to estimate both individual social network size (i.e., degree) and the distribution of network sizes in a population is prosed. A second method estimates the demographic characteristics of hard-to-reach groups, or latent demographic profiles. These groups, such as those with HIV/AIDS, unlawful immigrants, or the homeless, are often excluded from the sampling frame of standard social science surveys. A third method develops a latent space model for ARD. This method is similar in spirit to previous latent space models for networks (see Hoff, Raftery and Handcock (2002), for example) in that the dependence structure of the network is represented parsimoniously in a multidimensional geometric space. The key distinction from the complete network case is that instead of conditioning on the (latent) distance between two members of the network, the latent space model for ARD conditions on the expected distance between a survey respondent and the center of a subpopulation in the latent space. A spherical latent space facilitates tractable computation of this expectation. This model estimates relative homogeneity between groups in the population and variation in the propensity for interaction between respondents and group members.Statisticsthm2105StatisticsDissertations