Theses Doctoral

Machine Learning Applications in Soft Matter Systems

Shastry, Tejus

As modern research problems become increasingly complex, more efficient methods are needed to navigate high-dimensional variable space or automate time-intensive steps. Machine learning has been presented as a highly capable toolkit to address these issues by means of high-throughput screening, design of experiments, and highly project-specific applications like computer vision. Recent advances across many fields have garnered attention, creating buzz around terms like "black box models", but this does not need to be a necessary compromise to effectively leverage the tools machine learning offers. In order to surmount justified suspicion around machine learning and deep learning, model interpretability and impact must take center stage in order to build trust within the natural science community.

In service of this mission, we present work on several somewhat disjoint projects, but with an overarching focus of transparency and workflow direction or optimization. This thesis explores applications of machine learning such as quantitative structure-activity relationships and feature importance, automated object detection, and accelerated data processing for various soft matter applications.

In chapter 1, we explore the gas transport properties of polymer membranes as an alternative to energy-intensive separations like distillation. Under a cheminformatics approach, literature data covering the past 70 years are compiled and used to model ideal permeability and selectivity of six gases of interest (H₂, He, CO₂, Ov, N₂, CH₄). In attempts to solve the inverse design problem, i.e. predicting chemical structures from desired properties, we encounter significant obstacles in the form of non-robust mappings between chemical structure and traditional input representations for machine learning. We address this issue by resorting to simpler, more deterministic representations at only a slight cost to accuracy. This enables more reliable correlative analysis between features and model predictions, while also providing previously inaccessible insights by virtue of the choice of representation. The resulting structure-activity model both agrees with physics-based intuition and serves as the foundation for molecular design tasks, enabling high-throughput screening of polymer formulations without needing to manually perform thousands of experiments.

In chapter 2, we adapt two methods of object detection in service of image analysis for DNA origami superlattice studies, which would otherwise need to be done by hand. This involves recognizing highly translucent objects of generally cubic shape with enough accuracy to control online experimental procedures. Our two approaches leverage a convolutional neural network designed to excel in conservative yet representative estimates of superlattice count and size, and an image segmentation toolkit optimized for fast detections without specific training on the experimental images. We balance the utility and logistics of each approach in order to provide a pathway to a closed-loop autonomous experiment system for use in kinetics studies of the DNA superlattice formation.

In chapter 3, x-ray scattering data of these superlattices is used to quickly and accurately sort data on the basis of sample presence and superlattice orientation in positive cases of the former. These superlattices present immense utility for directed assembly of two- and three-dimensional architectures, but their integration into other systems relies on consistent placement and orientation. In order to further study preferential orientations under various experimental settings, we employ unsupervised learning methods to quickly and reliably sort data for use in further studies.

Through a wide range of applications, we emphasize the potential of machine learning as an interactive research tool, while simultaneously attempting to demystify some of the lesser-known aspects and pitfalls that future users may encounter. We choose the most challenging portions of these projects not to cast doubt on machine learning overall, but to demonstrate the importance of careful data and model choices when using machine learning on highly specific tasks such as soft matter research. Under proper use, with plenty of audits along the way, these tools can provide massive savings in both time and material usage.

Files

This item is currently under embargo. It will be available starting 2027-06-17.

More About This Work

Academic Units
Chemical Engineering
Thesis Advisors
Kumar, Sanat K.
Degree
Ph.D., Columbia University
Published Here
July 30, 2025