2025 Theses Doctoral
Large-scale model training: Dataset construction, reliable scaling, and task-specific adaptation
In recent years, machine learning models have evolved from academic curiosities into widely adopted mainstream tools. This dissertation examines the large-scale training paradigm that enabled this transformation. We first develop an experimental framework for studying dataset curation techniques at internet-scale, demonstrating its value by training state-of-the-art multimodal models by selecting data more intentionally.
Next, we investigate model scaling in language models, revealing that performance---including on downstream tasks---is highly predictable from small-scale experiments, enabling informed tradeoffs between parameter count, training duration, and inference cost. Finally, we develop techniques for improving domain-specific model performance while preserving model generality, including an approach to adapt pre-trained models for zero-shot navigation and a parameter-efficient method for targeted model improvements. Our findings signal promising directions for advancing the science and methodology of large-scale training.
Subjects
Files
-
Gadre_columbia_0054D_19394.pdf
application/pdf
6.5 MB
Download File
More About This Work
- Academic Units
- Computer Science
- Thesis Advisors
- Song, Shuran
- Degree
- Ph.D., Columbia University
- Published Here
- September 3, 2025