2021 Theses Doctoral

# Topics in Bayesian Design and Analysis for Sampling

Survey sampling is an old field, but it is changing due to recent advancement in statistics and data science. More specifically, modern statistical techniques have provided us with new tools to solve old problems in potentially better ways, and new problems arise as data with complex and rich information become more available nowadays. This dissertation is consisted of three parts, with the first part being an example of solving an old problem with new tools, the second part solving a new problem in a data-rich setting, and the third part from a design perspective. All three parts deal with modeling survey data and auxiliary information using flexible Bayesian models.

In the first part, we consider Bayesian model-based inference for skewed survey data. Skewed data are common in sample surveys. Using probability proportional to size sampling as an example, where the values of a size variable are known for the population units, we propose two Bayesian model-based predictive methods for estimating finite population quantiles with skewed sample survey data. We assume the survey outcome to follow a skew-normal distribution given the probability of selection, and model the location and scale parameters of the skew-normal distribution as functions of the probability of selection. To allow a flexible association between the survey outcome and the probability of selection, the first method models the location parameter with a penalized spline and the scale parameter with a polynomial function, while the second method models both the location and scale parameters with penalized splines. Using a fully Bayesian approach, we obtain the posterior predictive distributions of the non-sampled units in the population, and thus the posterior distributions of the finite population quantiles. We show through simulations that our proposed methods are more efficient and yield shorter credible intervals with better coverage rates than the conventional weighted method in estimating finite population quantiles. We demonstrate the application of our proposed methods using data from the 2013 National Drug Abuse Treatment System Survey.

In the second part, we consider inference from non-random samples in data-rich settings where high-dimensional auxiliary information is available both in the sample and the target population, with survey inference being a special case. We propose a regularized prediction approach that predicts the outcomes in the population using a large number of auxiliary variables such that the ignorability assumption is reasonable while the Bayesian framework is straightforward for quantification of uncertainty. Besides the auxiliary variables, inspired by Little and An (2004), we also extend the approach by estimating the propensity score for a unit to be included in the sample and also including it as a predictor in the machine learning models. We show through simulation studies that the regularized predictions using soft Bayesian additive regression trees (SBART) yield valid inference for the population means and coverage rates close to the nominal levels. We demonstrate the application of the proposed methods using two different real data applications, one in a survey and one in an epidemiology study.

In the third part, we consider survey design for multilevel regression and post-stratification (MRP), a survey adjustment technique that corrects the known discrepancy between sample and population using shared auxiliary variables. MRP has been widely applied in survey analysis, for both probability and non-probability samples. However, literature on survey design for MRP is scarce. We propose a closed form formula to calculate theoretical margin of errors (MOEs) for various estimands based on the variance parameters in the multilevel regression model and sample sizes in the post-strata. We validate the theoretical MOEs via comparisons with the empirical MOEs in simulations studies covering various sample allocation plans. The validation procedure indicates that the theoretical MOEs based on the formula aligns with the empirical results for various estimands. We demonstrate the application of the sample size calculation formula in two different survey design scenarios, online panels that utilize quota sampling and telephone surveys with fixed total sample sizes.

## Files

- Liu_columbia_0054D_16378.pdf application/pdf 3.9 MB Download File

## More About This Work

- Academic Units
- Biostatistics
- Thesis Advisors
- Chen, Qixuan
- Gelman, Andrew E.
- Degree
- Ph.D., Columbia University
- Published Here
- February 22, 2021