Conceptualizing Treatment Leakage in Text-based Causal Inference

Adel Daoud | Connor Jerzak | Richard Johansson

Paper .bib

Motivation

Text data is becoming an increasingly popular source of quantitative information for those working in social science fields such as economics, sociology, and political science. Naturally, researchers have sought out text data that might provide information for helping distinguishing correlation from causation in observational studies.

For example, written documents produced by the International Monetary Fund (IMF) about different country economies can provide information for helping model which countries were more likely to receive an IMF program (a “treatment”); this model can be used to adjust (via re-weighting) our data in order to produce more credible causal estimates.

Paper Contributions

The problem that this paper tackled was one we termed “treatment leakage.” Text data is inherently high-dimensional and unstructured. Some parts of a document the may contain information about the treatment; other parts may contain information about the outcome. In the language of directed acyclic graphs (DAGs), text as a data source potentially contains multiple causal node types. In more intuitive terms, text is messy and might not be just a proxy for confounding variables or a predictor of the outcome—it can serve multiple functions at once.

Our paper formalizes the problem, proposes a method for addressing it in practice, examines the assumptions required for the method to remain valid, validates it via simulation using GPT models, and examines performance in real data from the IMF example. We are excited about extending the work to other settings—such as satellite data—where a single data source may contain information from multiple differing aspects of a causal system.

References

Adel Daoud, Connor T. Jerzak, Richard Johansson. Conceptualizing Treatment Leakage in Text-based Causal Inference. Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL): 5638-5645, 2022.
@article{daoud2022conceptualizing,
  title={Conceptualizing Treatment Leakage in Text-based Causal Inference},
  author={Daoud, Adel and Connor T. Jerzak and Richard Johansson},
  journal={Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL)},
  year={2022},
  volume={},
  pages={5638-5645},
  publisher={}
}

Related Work

Nicolas Audinet de Pieuchon, Adel Daoud, Connor T. Jerzak, Moa Johansson, Richard Johansson. Can Large Language Models (or Humans) Distill Text?. Sixth Workshop on NLP and Computational Social Science at NAACL, 2024.
@article{pieuchon2024can,
  title={Can Large Language Models (or Humans) Distill Text?},
  author={Pieuchon, Nicolas Audinet de and Adel Daoud and Connor T. Jerzak and Moa Johansson and Richard Johansson},
  journal={Sixth Workshop on NLP and Computational Social Science at NAACL},
  year={2024},
  pages={},
  publisher={}
}
[Overview]

Connor T. Jerzak, Gary King, Anton Strezhnev. An Improved Method of Automated Nonparametric Content Analysis for Social Science. Political Analysis, 31(1): 42-58, 2023.
@article{jerzak2023improved,
  title={An Improved Method of Automated Nonparametric Content Analysis for Social Science},
  author={Jerzak, Connor T. and Gary King and Anton Strezhnev},
  journal={Political Analysis},
  year={2023},
  volume={31},
  number={1},
  pages={42-58},
  publisher={}
}
[Overview]

Back to Research
Back to Home