Can Large Language Models (or Humans) Distill Text?

Nicolas Audinet de Pieuchon | Adel Daoud | Connor Jerzak | Moa Johansson | Richard Johansson

Paper


Abstract: We investigate the potential of large language models (LLMs) to distill text: to remove the textual traces of an undesired forbidden variable. We employ a range of LLMs with varying architectures and training approaches to distill text by identifying and removing information about the target variable while preserving other relevant signals. Our findings shed light on the strengths and limitations of LLMs in addressing the distillation and provide insights into the strategies for leveraging these models in computational social science investigations involving text data. In particular, we show that in the strong test of removing sentiment, the statistical association between the processed text and sentiment is still clearly detectable to machine learning classifiers post-LLM-distillation. Furthermore, we find that human annotators also struggle to distill sentiment while preserving other semantic content. This suggests there may be limited separability between concept variables in some text contexts, highlighting limitations of methods relying on text-level transformations and also raising questions about the robustness of distillation methods that achieve statistical independence in representation space if this is difficult for human coders operating on raw text to attain.

References

Nicolas Audinet de Pieuchon, Adel Daoud, Connor T. Jerzak, Moa Johansson, Richard Johansson. Can Large Language Models (or Humans) Distill Text?. Sixth Workshop on NLP and Computational Social Science at NAACL, 2024.
@article{pieuchon2024can,
  title={Can Large Language Models (or Humans) Distill Text?},
  author={Pieuchon, Nicolas Audinet de and Adel Daoud and Connor T. Jerzak and Moa Johansson and Richard Johansson},
  journal={Sixth Workshop on NLP and Computational Social Science at NAACL},
  year={2024},
  pages={},
  publisher={}
}

Related Work

Adel Daoud, Connor T. Jerzak, Richard Johansson. Conceptualizing Treatment Leakage in Text-based Causal Inference. Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL): 5638-5645, 2022.
@article{daoud2022conceptualizing,
  title={Conceptualizing Treatment Leakage in Text-based Causal Inference},
  author={Daoud, Adel and Connor T. Jerzak and Richard Johansson},
  journal={Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL)},
  year={2022},
  volume={},
  pages={5638-5645},
  publisher={}
}
[Overview]

Connor T. Jerzak, Gary King, Anton Strezhnev. An Improved Method of Automated Nonparametric Content Analysis for Social Science. Political Analysis, 31(1): 42-58, 2023.
@article{jerzak2023improved,
  title={An Improved Method of Automated Nonparametric Content Analysis for Social Science},
  author={Jerzak, Connor T. and Gary King and Anton Strezhnev},
  journal={Political Analysis},
  year={2023},
  volume={31},
  number={1},
  pages={42-58},
  publisher={}
}
[Overview]

Back to Research
Back to Home