An Improved Method of Automated Nonparametric Content Analysis (“readme2”)

Connor Jerzak | Anton Strezhnev | Gary King

Paper Code Data

Motivation

Text data is an increasingly common source of information for social scientists and others seeking to quantify the balance of opinions in a set of documents (examples of documents might include tweets, legislative speeches, blogs, and so forth). Given some data that researchers have labeled, what is the overall distribution of opinion labels in the unlabeled test set?

This is particularly relevant when researchers seek to project out from past text into the future without having to manually look through new documents and label them as falling into different categories (such as positive/neutral/negative sentiment around a politician, for example). This distribution prediction task, known as quantification (as opposed to document-level classification), is the focus of this project.

Paper Contributions

Our method builds on prior work that had figured out a way to calculate the test set distribution of documents using the training set and applying a principle known as the Law of Total Probability.

We derived the analytical bias of this kind of method in a more general setting where we can use both discrete or continuous text features using the Law of Total Expectation. We then used information about that analytical bias to optimize the features used in the calculation of the test set distribution specifically to minimize the estimation bias.

The paper also introduces a text matching step borrowed from causal inference that reduces some of the assumptions of prior methods, allowing text to vary from labeled to unlabeled set in substantial ways while still being able to perform robust estimation. We also outline other ways language can change over time that poses problems even for this improved methodological tool.

Overall, the contribution of the project is to improve upon methods for quantification, helping researchers better understand the balance of opinion in text data that may be changing over time by representing that text data in an explicilty optimized numerical form.

References

Connor T. Jerzak, Gary King, Anton Strezhnev. An Improved Method of Automated Nonparametric Content Analysis for Social Science. Political Analysis, 31(1): 42-58, 2023.
@article{jerzak2023improved,
  title={An Improved Method of Automated Nonparametric Content Analysis for Social Science},
  author={Jerzak, Connor T. and Gary King and Anton Strezhnev},
  journal={Political Analysis},
  year={2023},
  volume={31},
  number={1},
  pages={42-58},
  publisher={}
}
[Data][Code]

Related Work

Adel Daoud, Connor T. Jerzak, Richard Johansson. Conceptualizing Treatment Leakage in Text-based Causal Inference. Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL): 5638-5645, 2022.
@article{daoud2022conceptualizing,
  title={Conceptualizing Treatment Leakage in Text-based Causal Inference},
  author={Daoud, Adel and Connor T. Jerzak and Richard Johansson},
  journal={Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL)},
  year={2022},
  volume={},
  pages={5638-5645},
  publisher={}
}
[Overview]

Back to Research
Back to Home