Conceptualizing Treatment Leakage in Text-based Causal Inference

Text data is becoming an increasingly popular source of quantitative information for those working in social science fields such as economics, sociology, and political science. Naturally, researchers have sought out text data that might provide information for helping distinguishing correlation from causation in observational studies.

For example, written documents produced by the International Monetary Fund (IMF) about different country economies can provide information for helping model which countries were more likely to receive an IMF program (a “treatment”); this model can be used to adjust (via re-weighting) our data in order to produce more credible causal estimates.

The problem that this paper tackled was one we termed “treatment leakage.” Text data is inherently high-dimensional and unstructured. Some parts of a document the may contain information about the treatment; other parts may contain information about the outcome. In the language of directed acyclic graphs (DAGs), text as a data source potentially contains multiple causal node types. In more intuitive terms, text is messy and might not be just a proxy for confounding variables or a predictor of the outcome—it can serve multiple functions at once.

Our paper formalizes the problem, proposes a method for addressing it in practice, examines the assumptions required for the method to remain valid, validates it via simulation using GPT models, and examines performance in real data from the IMF example. We are excited about extending the work to other settings—such as satellite data—where a single data source may contain information from multiple differing aspects of a causal system.


Adel Daoud, Connor T. Jerzak, Richard Johansson. Conceptualizing Treatment Leakage in Text-based Causal Inference. NAACL, 2022.
  title={Conceptualizing Treatment Leakage in Text-based Causal Inference},
  author={Daoud, Adel and Connor T. Jerzak and Richard Johansson},

Related Work

Connor T. Jerzak, Gary King, Anton Strezhnev. An Improved Method of Automated Nonparametric Content Analysis for Social Science. Political Analysis, 31(1): 42-58, 2023.
  title={An Improved Method of Automated Nonparametric Content Analysis for Social Science},
  author={Jerzak, Connor T. and Gary King and Anton Strezhnev},
  journal={Political Analysis},


Back to Research page