2020 Theses Doctoral
Reconfigurable Optically Interconnected Systems
With the immense growth of data consumption in today's data centers and high-performance computing systems driven by the constant influx of new applications, the network infrastructure supporting this demand is under increasing pressure to enable higher bandwidth, latency, and flexibility requirements. Optical interconnects, able to support high bandwidth wavelength division multiplexed signals with extreme energy efficiency, have become the basis for long-haul and metro-scale networks around the world, while photonic components are being rapidly integrated within rack and chip-scale systems. However, optical and photonic interconnects are not a direct replacement for electronic-based components. Rather, the integration of optical interconnects with electronic peripherals allows for unique functionalities that can improve the capacity, compute performance and flexibility of current state-of-the-art computing systems. This requires physical layer methodologies for their integration with electronic components, as well as system level control planes that incorporates the optical layer characteristics. This thesis explores various network architectures and the associated control plane, hardware infrastructure, and other supporting software modules needed to integrate silicon photonics and MEMS based optical switching into conventional datacom network systems ranging from intra-data center and high-performance computing systems to the metro-scale layer networks between data centers. In each of these systems, we demonstrate dynamic bandwidth steering and compute resource allocation capabilities to enable significant performance improvements. The key accomplishments of this thesis are as follows.
In Part 1, we present high-performance computing network architectures that integrate silicon photonic switches for optical bandwidth steering, enabling multiple reconfigurable topologies that results in significant system performance improvements. As high-performance systems rely on increased parallelism by scaling up to greater numbers of processor nodes, communication between these nodes grows rapidly and the interconnection network becomes a bottleneck to the overall performance of the system. It has been observed that many scientific applications operating on high-performance computing systems cause highly skewed traffic over the network, congesting only a small percentage of the total available links while other links are underutilized. This mismatch of the traffic and the bandwidth allocation of the physical layer network presents the opportunity to optimize the bandwidth resource utilization of the system by using silicon photonic switches to perform bandwidth steering. This allows the individual processors to perform at their maximum compute potential and thereby improving the overall system performance. We show various testbeds that integrates both microring resonator and Mach-Zehnder based silicon photonic switches within Dragonfly and Fat-Tree topology networks built with conventional
equipment, and demonstrate 30-60% reduction in execution time of real high-performance benchmark applications.
Part 2 presents a flexible network architecture and control plane that enables autonomous bandwidth steering and IT resource provisioning capabilities between metro-scale geographically distributed data centers. It uses a software-defined control plane to autonomously provision both network and IT resources to support different quality of service requirements and optimizes resource utilization under dynamically changing load variations. By actively monitoring both the bandwidth utilization of the network and CPU or memory resources of the end hosts, the control plane autonomously provisions background or dynamic connections with different levels of quality of service using optical MEMS switching, as well as initializing live migrations of virtual machines to consolidate or distribute workload. Together these functionalities provide flexibility and maximize efficiency in processing and transferring data, and enables energy and cost savings by scaling down the system when resources are not needed. An experimental testbed of three data center nodes was built to demonstrate the feasibility of these capabilities.
Part 3 presents Lightbridge, a communications platform specifically designed to provide a more seamless integration between processor nodes and an optically switched network. It addresses some of the crucial issues faced by the works presented in the previous chapters related to optical switching. When optical switches perform switching operations, they change the physical topology of the network, and they lack the capability to buffer packets, resulting in certain optical circuits being unavailable. This prompts the question of whether it is safe to transmit packets by end hosts at any given time. Lightbridge was developed to coordinate switching and routing of optical circuits across the network, by having the processors gain information about the current state of the optical network before transmitting packets, and being able to buffer packets when the optical circuit is not available. This part describes details of Lightbridge which is constituted by a loadable Linux kernel module along with other supporting modifications to the Linux kernel in order to achieve the necessary functionalities.
- Shen_columbia_0054D_15770.pdf application/pdf 5.01 MB Download File
More About This Work
- Academic Units
- Electrical Engineering
- Thesis Advisors
- Bergman, Keren
- Ph.D., Columbia University
- Published Here
- February 13, 2020