Theses Doctoral

Analyzing and Securing Software via Robust and Generalizable Learning

Pei, Kexin

Software permeates every facet of our lives, improving their convenience and efficiency, and its sphere of influence continues to expand, leading to novel applications and services. However, as software grows in complexity, it increasingly exposes vulnerabilities within the intricate landscape of security threats. Program analysis emerges as a pivotal technique for constructing software that is secure, reliable, and efficient. Despite this, existing methodologies predominantly rely on rules and heuristics, which necessitate substantial manual tuning to accommodate the diverse components of software.

In this dissertation, I introduce our advancements in data-driven program analysis, a novel approach in which we employ machine learning techniques to comprehend both the structures and behaviors of programs, thereby enhancing the analysis and security of software applications. Besides focusing on traditional software, I also elaborate on our work in the systematic testing and formal verification of learned software components, including neural networks.

I commence by detailing a succession of studies centered on the ambitious goal of learning execution-aware program representations. This is achieved by training large language models to understand program execution semantics. I illustrate that the models equipped with execution-aware pre-training attain state-of-the-art results in a range of program analysis tasks, such as detecting semantically similar code, type inference, memory dependence analysis, debugging symbol recovery, and generating invariants. Subsequently, I outline our approach to learning program structures and dependencies for disassembly and function boundary recovery, which are building blocks for downstream reverse engineering and binary analysis tasks.

In the final part of this dissertation, I delve into DeepXplore, the inaugural white-box testing framework designed for deep learning systems, and VeriVis, a pioneering verification framework capable of proving the robustness guarantee of neural networks with only black-box access, extending beyond norm-bounded input transformations.


  • thumnail for Pei_columbia_0054D_17951.pdf Pei_columbia_0054D_17951.pdf application/pdf 4.49 MB Download File

More About This Work

Academic Units
Computer Science
Thesis Advisors
Jana, Suman
Yang, Junfeng
Ph.D., Columbia University
Published Here
August 2, 2023