2024 Theses Doctoral
High-throughput screening enabled advances in protein engineering
Nature has produced a dazzling array of proteins which perform useful and interesting functions. Over the last 50 years, biologists have begun to re-engineer these tiny machines, either to perform new functions, or to perform their functions more efficiently. However, protein engineering suffers from the massive scale of the search space. To improve the field’s ability to understand and engineer proteins, we present improvements both to generating and understanding large protein-function data sets. We apply these approaches to two tasks, generating data sets that measure the activity of tens of thousands of protein variants, and producing two novel CRISPR activators and an improved machine learning model for designing protein using DMS data as input data.
In the first task, we engineer proteins at the level of domains, recombining trans-activation domains to generate improved CRISPR activators. By analyzing proteins at the level of domains, we simplify the protein engineering task into a smaller combinatorial problem. CRISPRa tools enable biologists to activate transcription at arbitrary locations using an easily retargetable CRISPR guide. In addition to producing two novel CRISPRa tools which outperform the current state-of-the-art, we perform what we believe to be the first systematic evaluation of the toxicity of CRISPRa tools in cells. We also perform a detailed analysis of the ways in which trans- activation domains interact in a multi-domain tool, and the impact of these interactions on both gene activation strength and toxicity.
Our second protein engineering project approaches the problem at the level of individual amino acids. One target is a chaperone protein, DNAJB6, which our lab previously uncovered as a rescuer of toxicity for multiple neuro-degenerative proteins28. We use error-prone PCR to generate a library of over 30,000 compound mutants, which we screen using a yeast-based assay for their ability to rescue toxicity associated with an aggregation prone protein, FUS. We also engineer GFP, which has many existing deep mutational scanning (DMS) datasets.
To engineer these proteins, we develop a machine learning model, OptiProt. Optiprot is trained on DMS data to approximate the sequence-to-score landscape and then perform machine- learning directed evolution (MLDE), designing proteins that meet user-defined criteria. We test OptiProt’s ability to learn and generalize from our DNAJB6 and GFP DMS datasets by adding increasingly difficult constraints and asking it to solve them. OptiProt was able to design proteins which improved on the best variants within the DMS data as well as integrate up to 50 mutations without breaking the functionality of the wildtype protein. Finally, we task OptiProt with difficult challenges such as compensating for a loss-of-function mutant or replacing every instance of a certain amino acid.
Files
-
Kratz_columbia_0054D_18973.pdf application/pdf 5.03 MB Download File
More About This Work
- Academic Units
- Cellular, Molecular and Biomedical Studies
- Thesis Advisors
- Chavez, Alejandro
- Degree
- Ph.D., Columbia University
- Published Here
- January 15, 2025