Conceptualizing Treatment Leakage in Text-based Causal Inference

Project Summary

Text data is becoming an increasingly popular source of quantitative information for those working in social science fields such as economics, sociology, and political science. Naturally, researchers have sought out text data that might provide information for helping distinguishing correlation from causation in observational studies. For example, written documents produced by the International Monetary Fund (IMF) about different country economies can provide information for helping model which countries were more likely to receive an IMF program (a “treatment”); this model can be used to adjust (via re-weighting) our data in order to produce more credible causal estimates.

The problem that this paper tackled was one we termed “treatment leakage.” Text data is inherently high-dimensional and unstructured. Some parts of a document the may contain information about the treatment; other parts may contain information about the outcome. In the language of directed acyclic graphs (DAGs), text as a data source potentially contains multiple causal node types. In more intuitive terms, text is messy and might not be just a proxy for confounding variables or a predictor of the outcome—it can serve multiple functions at once. Our paper formalizes the problem, proposes a method for addressing it in practice, examines the assumptions required for the method to remain valid, and validates it via simulation and in real data from the IMF example. We are excited about extending the work to other settings—such as satellite data—where a single data source may contain information from multiple differing aspects of a causal system.


Adel Daoud, Connor T. Jerzak, Richard Johansson. Conceptualizing Treatment Leakage in Text-based Causal Inference. NAACL, 2022.
[BibTeX] [Download PDF]
  title={Conceptualizing Treatment Leakage in Text-based Causal Inference},
  author={Daoud, Adel and Connor T. Jerzak and Richard Johansson},

Back to Research page