Theses Doctoral

Harmonizing Audio and Human Interaction: Enhancement, Analysis, and Application of Audio Signals via Machine Learning Approaches

Xu, Ruilin

In this thesis, we tackle key challenges in processing audio signals, specifically focusing on speech and music. These signals are crucial for human interaction with both the environment and machines. Our research addresses three core topics: speech denoising, speech dereverberation, and music-dance generation, each of which plays a vital role in enhancing the harmony between audio and human interaction.

Leveraging machine learning and human-centric approaches inspired by classical algorithms, we develop methods to mitigate common audio degradations, such as additive noise and multiplicative reverberation, delivering high-quality audio suitable for human use and applications. Furthermore, we introduce a real-time, music-responsive system for generating 3D dance animations, advancing the integration of audio signals with human engagement.

The first focus of our thesis is the elimination of additive noise from audio signals by focusing on short pauses, or silent intervals, in human speech. These brief pauses provide key insights into the noise profile, enabling our model to dynamically reduce ambient noise from speech. Tested across diverse datasets, our method outperforms traditional and audiovisual denoising techniques, showcasing its effectiveness and adaptability across different languages and even musical contexts.

In the second work of our research, we address reverberation removal from audio signals, a task traditionally reliant on knowing the environment's exact impulse response—a requirement often impractical in real-world settings. Our novel solution combines the strengths of classical and learning-based approaches, tailored for online communication contexts. This human-centric method includes a one-time personalization step, adapting to specific environments and human speakers. The two-stage model, integrating feature-based Wiener deconvolution and network refinement, has shown through extensive experiments to outperform current methods, both in effectiveness and user preference.

Transitioning from foundational audio signal enhancement and analysis to a more dynamic realm, our research culminates in a novel, interactive system for real-time 3D human dance generation. Contrasting with the passive human-centric assumptions of our previous works, this final work actively engages users, enabling direct interaction with a system that synchronizes expressive dance movements to live music, spanning various musical elements like type, tempo, and energy. This innovative approach, diverging from traditional choreography methods, leverages spontaneous improvisation to generate unique dance sequences. These sequences, a mix of pre-recorded choreographies and algorithm-generated transitions, adapt to real-time audio inputs, offering customization through personal 3D avatars. This system's user-centric design and interactivity are validated by user studies, confirming its effectiveness in creating an immersive and engaging user experience.

Files

  • thumnail for Xu_columbia_0054D_18779.pdf Xu_columbia_0054D_18779.pdf application/pdf 2.2 MB Download File

More About This Work

Academic Units
Computer Science
Thesis Advisors
Nayar, Shree K.
Degree
Ph.D., Columbia University
Published Here
September 25, 2024