Theses Doctoral

Large-scale model training: Dataset construction, reliable scaling, and task-specific adaptation

Gadre, Samir Yitzhak Arun

In recent years, machine learning models have evolved from academic curiosities into widely adopted mainstream tools. This dissertation examines the large-scale training paradigm that enabled this transformation. We first develop an experimental framework for studying dataset curation techniques at internet-scale, demonstrating its value by training state-of-the-art multimodal models by selecting data more intentionally.

Next, we investigate model scaling in language models, revealing that performance---including on downstream tasks---is highly predictable from small-scale experiments, enabling informed tradeoffs between parameter count, training duration, and inference cost. Finally, we develop techniques for improving domain-specific model performance while preserving model generality, including an approach to adapt pre-trained models for zero-shot navigation and a parameter-efficient method for targeted model improvements. Our findings signal promising directions for advancing the science and methodology of large-scale training.

Files

  • thumbnail for Gadre_columbia_0054D_19394.pdf Gadre_columbia_0054D_19394.pdf application/pdf 6.5 MB Download File

More About This Work

Academic Units
Computer Science
Thesis Advisors
Song, Shuran
Degree
Ph.D., Columbia University
Published Here
September 3, 2025