2025 Theses Doctoral
Transparent Deployment of Machine Learning Models on Many-Accelerator Architectures
The growing demand for machine learning (ML) and signal processing in embedded and edge systems introduces challenges in balancing performance, energy efficiency, and software development simplicity. Heterogeneous Systems-on-Chip (SoCs), which combine general-purpose processors with specialized accelerators, offer a promising solution. However, deploying applications efficiently on such architectures requires new hardware/software co-design methodologies, integration strategies, and runtime resource management.This dissertation presents an approach that separates application logic from accelerator integration, enabling more efficient deployment of ML and signal processing workloads on many-accelerator SoCs.
To begin, I introduce ESP4ML, an open-source design flow that combines ESP, a modular SoC platform, with hls4ml, a high-level synthesis tool for generating ML accelerators. ESP4ML provides an embedded runtime software application programming interface (API) for managing accelerators dynamically within Linux. It enables data transfer between accelerators, reducing memory accesses.These features allow rapid prototyping of SoCs and are demonstrated through FPGA-based implementations executing end-to-end embedded workloads.
Following that, I analyze communication models by comparing memory-based and point-to-point (p2p) data transfer mechanisms for accelerators. Using synthetic and real-world benchmarks, such as Nightvision and Wide-Area Motion Imagery (WAMI), I show that p2p communication consistently delivers better performance and energy efficiency, particularly in multi-threaded and tile-based accelerator systems.
Next, a co-design strategy is presented that integrates the Eigen C++ linear algebra library with ESP. This enables high-level software to transparently offload computations to accelerators, achieving significant gains in both performance and energy efficiency without compromising software simplicity.
To further improve the transparency of deploying ML workloads, I introduce WOLT, a lightweight software layer that enables hardware acceleration for TensorFlow Lite (TFLite) workloads. By leveraging TFLite’s delegate interface, WOLT routes supported operations to accelerators without modifying application code. In addition, a built-in resource manager supports conflict-free, multi-tenant execution of ML applications. FPGA-based experiments on 13 workloads show that WOLT improves system performance and efficiency.
Moreover, I propose BigEnough, an approach for employing multiple accelerators of matrix-matrix multiplication tailored to different input sizes. A smart scheduler dynamically selects the best-fit accelerator at runtime, balancing computational load and maximizing energy efficiency in multi-tenant environments.
In summary, my thesis is that abstracting and separating the design of hardware and software components enables the seamless deployment of software applications, including machine learning models, onto heterogeneous many-accelerator architectures, with optimization of efficiency through a resource manager for parallel execution and streamlined accelerator invocation. I present a comprehensive and practical approach to deploying ML and signal processing workloads on heterogeneous SoCs. By combining modular hardware design, high-level software integration, and runtime efficiency, my dissertation advances embedded system design and paves the way for scalable, programmable edge computing platforms.
Subjects
Files
-
Chiu_columbia_0054D_19154.pdf
application/pdf
1.71 MB
Download File
More About This Work
- Academic Units
- Computer Science
- Thesis Advisors
- Carloni, Luca
- Degree
- Ph.D., Columbia University
- Published Here
- May 21, 2025