Academic Commons

Theses Doctoral

Advances in Model Selection Techniques with Applications to Statistical Network Analysis and Recommender Systems

Franco Saldana, Diego

This dissertation focuses on developing novel model selection techniques, the process by which a statistician selects one of a number of competing models of varying dimensions, under an array of different statistical assumptions on observed data. Traditionally, two main reasons have been advocated by researchers for performing model selection strategies over classical maximum likelihood estimates (MLEs). The first reason is prediction accuracy, where by shrinking or setting to zero some model parameters, one sacrifices the unbiasedness of MLEs for a reduced variance, which in turn leads to an overall improvement in predictive performance. The second reason relates to interpretability of the selected models in the presence of a large number of predictors, where in order to obtain a parsimonious representation exhibiting the relationship between the response and covariates, we are willing to sacrifice some of the smaller details brought in by spurious predictors.
In the first part of this work, we revisit the family of variable selection techniques known as sure independence screening procedures for generalized linear models and the Cox proportional hazards model. After clever combination of some of its most powerful variants, we propose new extensions based on the idea of sample splitting, data-driven thresholding, and combinations thereof. A publicly available package developed in the R statistical software demonstrates considerable improvements in terms of model selection and competitive computational time between our enhanced variable selection procedures and traditional penalized likelihood methods applied directly to the full set of covariates.
Next, we develop model selection techniques within the framework of statistical network analysis for two frequent problems arising in the context of stochastic blockmodels: community number selection and change-point detection. In the second part of this work, we propose a composite likelihood based approach for selecting the number of communities in stochastic blockmodels and its variants, with robustness consideration against possible misspecifications in the underlying conditional independence assumptions of the stochastic blockmodel. Several simulation studies, as well as two real data examples, demonstrate the superiority of our composite likelihood approach when compared to the traditional Bayesian Information Criterion or variational Bayes solutions. In the third part of this thesis, we extend our analysis on static network data to the case of dynamic stochastic blockmodels, where our model selection task is the segmentation of a time-varying network into temporal and spatial components by means of a change-point detection hypothesis testing problem. We propose a corresponding test statistic based on the idea of data aggregation across the different temporal layers through kernel-weighted adjacency matrices computed before and after each candidate change-point, and illustrate our approach on synthetic data and the Enron email corpus.
The matrix completion problem consists in the recovery of a low-rank data matrix based on a small sampling of its entries. In the final part of this dissertation, we extend prior work on nuclear norm regularization methods for matrix completion by incorporating a continuum of penalty functions between the convex nuclear norm and nonconvex rank functions. We propose an algorithmic framework for computing a family of nonconvex penalized matrix completion problems with warm-starts, and present a systematic study of the resulting spectral thresholding operators. We demonstrate that our proposed nonconvex regularization framework leads to improved model selection properties in terms of finding low-rank solutions with better predictive performance on a wide range of synthetic data and the famous Netflix data recommender system.

Files

  • thumnail for FrancoSaldana_columbia_0054D_13297.pdf FrancoSaldana_columbia_0054D_13297.pdf binary/octet-stream 4.74 MB Download File

More About This Work

Academic Units
Statistics
Thesis Advisors
Feng, Yang
Degree
Ph.D., Columbia University
Published Here
April 29, 2016
Academic Commons provides global access to research and scholarship produced at Columbia University, Barnard College, Teachers College, Union Theological Seminary and Jewish Theological Seminary. Academic Commons is managed by the Columbia University Libraries.