Theses Doctoral

A foundation model of transcription regulation and applications to cancer

Fu, Xi

Transcriptional regulation governs cellular identity through interactions between DNA sequences, chromatin states, and regulatory proteins. Understanding how the static genome encodes diverse cell types—and how dysregulation drives disease—requires predicting cell-type-specific gene expression and the molecular mechanisms underlying regulatory control. Existing computational approaches cannot generalize to unseen cell types without retraining, lack mechanistic interpretability, and fail to integrate the multimodal information—genomic sequence, chromatin accessibility, protein structure, and evolutionary constraints—that jointly determines transcription factor binding and gene regulation.

This thesis develops computational frameworks building toward a foundation model of transcription regulation with increasing resolution and mechanistic grounding. I first created GET (General Expression Transformer), which achieves experimental-level accuracy predicting gene expression in unseen human cell types by learning regulatory grammar from chromatin accessibility across 213 cell types through self-supervised pretraining. Applied to fetal erythroblasts, GET identifies distal regulatory elements controlling fetal hemoglobin and discovers coregulation patterns suggesting transcription factor interactions.

To investigate the molecular mechanisms underlying these interactions, I combined GET's coregulation predictions with AlphaFold structure prediction, constructing a catalog of transcription factor interaction structures. Applied to familial B-cell leukemia, this approach reveals how the PAX5 G183S germline variant disrupts an intrinsically disordered region mediating nuclear receptor interactions, validated through proximity labeling and patient transcriptomes. However, this structural analysis required knowing which disease-associating variant to investigate. To systematically identify functionally important regions across proteins without prior knowledge, I developed the ES score, combining AlphaFold structural features with evolutionary constraints from protein language models. This unsupervised approach identifies that disorder-order boundaries often underpin cancer hotspot mutation and demonstrates that protein language models capture evolutionary constraints generalizable across protein classes.

These insights—that transformers learn regulatory grammar from chromatin patterns, that cell-type specificity emerges from accessibility, and that ESM embeddings encode functional constraints—and the need to know where exactly a transcription factor binds to mechanistical understanding in transcription regulation motivated a multimodal approach. I co-developed Chromnitron, which combines DNA sequence, chromatin accessibility, and ESM protein embeddings to predict cell-type-specific binding for 767 chromatin-associated proteins at nucleotide resolution.

Trained on 1,105 ChIP-seq experiments, Chromnitron generalizes to unseen cell types and proteins. Systematic in silico perturbations across DNA, chromatin, and protein sequences reveal both canonical binding principles and nonlinear relationships between accessibility and binding. Applied to T cell exhaustion—a major barrier to cancer immunotherapy where tumor-infiltrating T cells lose cytotoxic function—Chromnitron discovers ZNF865 as novel regulators of TOX expression and directly modulates exhaustion phenotype, validated through pooled CRISPR screens in primary human T cells showing reduced exhaustion markers and restored effector function upon knockout.

Files

  • thumbnail for gsas-dissertations-000084.pdf gsas-dissertations-000084.pdf application/pdf 8.02 MB Download File

More About This Work

Academic Units
Biomedical Informatics
Thesis Advisors
Rabadan, Raul
Degree
Ph.D., Columbia University
Published Here
January 21, 2026