Academic Commons

Theses Doctoral

A Computational Perspective of Causal Inference and the Data Fusion Problem

Correa, Juan David

The ability to process and reason with causal information is fundamental in many aspects of human cognition and is pervasive in the way we probe reality in many of the empirical sciences. Given the centrality of causality through many aspects of human experience, we expect that the next generation of AI systems will need to represent causal knowledge, combine heterogeneous and biased datasets, and generalize across changing conditions and disparate domains to attain human-like intelligence.

This dissertation investigates a problem in causal inference known as Data Fusion, which is concerned with inferring causal and statistical relationships from a combination of heterogeneous data collections from different domains, with various experimental conditions, and with nonrandom sampling (sampling selection bias). Despite the general conditions and algorithms developed so far for many aspects of the fusion problem, there are still significant aspects that are not well-understood and have not been studied together, as they appear in many challenging real-world applications.

Specifically, this work advances our understanding of several dimensions of data fusion problems, which include the following capabilities and research questions: Reasoning with Soft Interventions. How to identify the effect of conditional and stochastic policies in a complex data fusion setting? Specifically, under what conditions can the effect of a new stochastic policy be evaluated using data from disparate sources and collected under different experimental conditions?
Deciding Statistical Transportability. Under what conditions can statistical relationships (e.g., conditional distributions, classifiers) be extrapolated across disparate domains, where the target is somewhat related but not the same as the source domain where the data was initially collected? How to leverage additional data over a few variables in the target domain to help with the generalization process?

Recovering from Selection Bias. How to determine whether a sample that was preferentially selected can be recovered so as to make a claim about the general underlying super-population? How can additional data over a subset of the variables, but sampled randomly, be used to achieve this goal?

Instead of developing conditions and algorithms for each problem independently, this thesis introduces a computational framework capable of solving those research problems when appearing together. The approach decomposes the query and available heterogeneous distributions into factors with a canonical form. Then, the inference process is reduced to mapping the required factors to those available from the data, and then evaluating the query as a function of the input based on the mapping.

The problems and methods discussed have several applications in the empirical sciences, statistics, machine learning, and artificial intelligence.


  • thumnail for Correa_columbia_0054D_16871.pdf Correa_columbia_0054D_16871.pdf application/pdf 1.52 MB Download File

More About This Work

Academic Units
Computer Science
Thesis Advisors
Bareinboim, Elias
Ph.D., Columbia University
Published Here
October 27, 2021