Theses Doctoral

Statistical analysis of large scale data with perturbation subsampling

Yao, Yujing

The past two decades have witnessed rapid growth in the amount of data available to us. Many fields, including physics, biology, and medical studies, generate enormous datasets with a large sample size, a high number of dimensions, or both. For example, some datasets in physics contains millions of records. It is forecasted by Statista Survey that in 2022, there will be over 86 millions users of health apps in United States, which will generate massive mHealth data. In addition, more and more large studies have been carried out, such as the UK Biobank study. This gives us unprecedented access to data and allows us to extract and infer vital information. Meanwhile, it also poses new challenges for statistical methodologies and computational algorithms.

For increasingly large datasets, computation can be a big hurdle for valid analysis. Conventional statistical methods lack the scalability to handle such large sample size. In addition, data storage and processing might be beyond usual computer capacity. The UK Biobank genotypes and phenotypes dataset contains about 500,000 individuals and more than 800,000 genotyped single nucleotide polymorphism (SNP) measurements per person, the size of which may well exceed a computer's physical memory. Further, the high dimensionality combined with the large sample size could lead to heavy computational cost and algorithmic instability.

The aim of this dissertation is to provide some statistical approaches to address the issues. Chapter 1 provides a review on existing literature. In Chapter 2, a novel perturbation subsampling approach is developed based on independent and identically distributed stochastic weights for the analysis of large scale data. The method is justified based on optimizing convex criterion functions by establishing asymptotic consistency and normality for the resulting estimators. The method can provide consistent point estimator and variance estimator simultaneously. The method is also feasible for a distributed framework. The finite sample performance of the proposed method is examined through simulation studies and real data analysis.

In Chapter 3, a repeated block perturbation subsampling is developed for the analysis of large scale longitudinal data using generalized estimating equation (GEE) approach. The GEE approach is a general method for the analysis of longitudinal data by fitting marginal models. The proposed method can provide consistent point estimator and variance estimator simultaneously. The asymptotic properties of the resulting subsample estimators are also studied. The finite sample performances of the proposed methods are evaluated through simulation studies and mHealth data analysis.

With the development of technology, large scale high dimensional data is also increasingly prevailing. Conventional statistical methods for high dimensional data such as adaptive lasso (AL) lack the scalability to handle processing of such large sample size. Chapter 4 introduces the repeated perturbation subsampling adaptive lasso (RPAL), a new procedure which incorporates features of both perturbation and subsampling to yield a robust, computationally efficient estimator for variable selection, statistical inference and finite sample false discovery control in the analysis of big data. RPAL is well suited to modern parallel and distributed computing architectures and furthermore retains the generic applicability and statistical efficiency. The theoretical properties of RPAL are studied and simulation studies are carried out by comparing the proposed estimator to the full data estimator and traditional subsampling estimators. The proposed method is also illustrated with the analysis of omics datasets.


  • thumnail for Yao_columbia_0054D_17281.pdf Yao_columbia_0054D_17281.pdf application/pdf 1.16 MB Download File

More About This Work

Academic Units
Thesis Advisors
Jin, Zhezhen
Ph.D., Columbia University
Published Here
June 8, 2022