Theses Doctoral

Towards Human-Level Speech Synthesis: Bridging Generative Models and Computational Neuroscience

Li, Yinghao

Recent advancements in speech synthesis technology have made significant strides, but achieving truly human-like speech remains a fundamental challenge. This dissertation presents a unified framework for decomposing speech into interpretable components - text content and speaking style - while preserving natural prosody and speaker characteristics. Through a series of innovative models and techniques, I demonstrate how this disentangled representation approach, informed by principles of human speech processing, can generate speech that is more natural, expressive, and perceptually aligned with human listeners, advancing one step further to human-level speech synthesis.

The thesis traces the evolution of style-based speech generation, from the foundations laid by StarGANv2-VC for unsupervised voice conversion to the development of StyleTTS, the first text-to-speech system to achieve human-level naturalness on public benchmark datasets. Throughout the chapters, I introduce various advanced techniques, such as transferable monotonic aligners for improved prosody, adversarial training with large speech language models, and direct optimization of perceptually relevant metrics.

A central theme that emerges is the convergence between successful AI architectures and patterns of human neural processing. Analysis of intracranial recordings and language model representations reveals that modern AI models naturally develop hierarchical feature extraction pathways for contextual information mirroring those in the auditory cortex. Leveraging this insight, innovations such as phoneme-level self-supervised pre-training (PL-BERT) and time-varying style modeling (StyleTTS-ZS) demonstrate the benefits of incorporating neural-inspired processing principles. The culmination of this research is Style-Talker, an end-to-end speech-to-speech interaction framework that seamlessly integrates text understanding, style generation, and neural-aligned fine-tuning. By teaching language models to jointly generate what to say and how to say it, Style-Talker achieves highly efficient dialogue generation with context-appropriate prosody and emotional delivery.

The technical contributions of this dissertation include: (1) an unsupervised voice conversion framework that generalizes across languages and speaking styles, (2) the first text-to-speech system to achieve human-level performance on public benchmark datasets through style-based generation, (3) an efficient zero-shot speaker adaptation technique using distilled diffusion models, and (4) empirical evidence linking successful generative model architectures to human cognitive processing mechanisms. My models achieve state-of-the-art performance across multiple tasks while maintaining computational efficiency, as demonstrated through extensive evaluations on standard benchmarks.

Beyond advancing the state-of-the-art, this dissertation establishes a new paradigm for speech synthesis research at the intersection of generative modeling, computational neuroscience, and human perception. The models developed not only demonstrate superior performance and efficiency, but also illuminate fundamental principles for bridging artificial and biological speech processing. In laying this groundwork, the thesis opens pathways toward next-generation speech technologies that think, understand, and communicate in increasingly human-like ways.

Files

  • thumnail for Li_columbia_0054D_19049.pdf Li_columbia_0054D_19049.pdf application/pdf 21.2 KB Download File

More About This Work

Academic Units
Electrical Engineering
Thesis Advisors
Mesgarani, Nima
Degree
Ph.D., Columbia University
Published Here
April 16, 2025