Theses Doctoral

Transparent Deployment of Machine Learning Models on Many-Accelerator Architectures

Chiu, Kuan-Lin

The growing demand for machine learning (ML) and signal processing in embedded and edge systems introduces challenges in balancing performance, energy efficiency, and software development simplicity. Heterogeneous Systems-on-Chip (SoCs), which combine general-purpose processors with specialized accelerators, offer a promising solution. However, deploying applications efficiently on such architectures requires new hardware/software co-design methodologies, integration strategies, and runtime resource management.This dissertation presents an approach that separates application logic from accelerator integration, enabling more efficient deployment of ML and signal processing workloads on many-accelerator SoCs.

To begin, I introduce ESP4ML, an open-source design flow that combines ESP, a modular SoC platform, with hls4ml, a high-level synthesis tool for generating ML accelerators. ESP4ML provides an embedded runtime software application programming interface (API) for managing accelerators dynamically within Linux. It enables data transfer between accelerators, reducing memory accesses.These features allow rapid prototyping of SoCs and are demonstrated through FPGA-based implementations executing end-to-end embedded workloads.

Following that, I analyze communication models by comparing memory-based and point-to-point (p2p) data transfer mechanisms for accelerators. Using synthetic and real-world benchmarks, such as Nightvision and Wide-Area Motion Imagery (WAMI), I show that p2p communication consistently delivers better performance and energy efficiency, particularly in multi-threaded and tile-based accelerator systems.

Next, a co-design strategy is presented that integrates the Eigen C++ linear algebra library with ESP. This enables high-level software to transparently offload computations to accelerators, achieving significant gains in both performance and energy efficiency without compromising software simplicity.

To further improve the transparency of deploying ML workloads, I introduce WOLT, a lightweight software layer that enables hardware acceleration for TensorFlow Lite (TFLite) workloads. By leveraging TFLite’s delegate interface, WOLT routes supported operations to accelerators without modifying application code. In addition, a built-in resource manager supports conflict-free, multi-tenant execution of ML applications. FPGA-based experiments on 13 workloads show that WOLT improves system performance and efficiency.

Moreover, I propose BigEnough, an approach for employing multiple accelerators of matrix-matrix multiplication tailored to different input sizes. A smart scheduler dynamically selects the best-fit accelerator at runtime, balancing computational load and maximizing energy efficiency in multi-tenant environments.

In summary, my thesis is that abstracting and separating the design of hardware and software components enables the seamless deployment of software applications, including machine learning models, onto heterogeneous many-accelerator architectures, with optimization of efficiency through a resource manager for parallel execution and streamlined accelerator invocation. I present a comprehensive and practical approach to deploying ML and signal processing workloads on heterogeneous SoCs. By combining modular hardware design, high-level software integration, and runtime efficiency, my dissertation advances embedded system design and paves the way for scalable, programmable edge computing platforms.

Files

  • thumbnail for Chiu_columbia_0054D_19154.pdf Chiu_columbia_0054D_19154.pdf application/pdf 1.71 MB Download File

More About This Work

Academic Units
Computer Science
Thesis Advisors
Carloni, Luca
Degree
Ph.D., Columbia University
Published Here
May 21, 2025