An Improved Method of Automated Nonparametric Content Analysis (“readme2”)

Text data is an increasingly common source of information for social scientists and others seeking to quantify the balance of opinions in a set of documents (examples of documents might include tweets, legislative speeches, blogs, and so forth). Given some data that researchers have labeled, what is the overall distribution of opinion labels in the unlabeled test set?

This is particularly relevant when researchers seek to project out from past text into the future without having to manually look through new documents and label them as falling into different categories (such as positive/neutral/negative sentiment around a politician, for example). This distribution prediction task, known as quantification (as opposed to document-level classification), is the focus of this project.

Our method builds on prior work that had figured out a way to calculate the test set distribution of documents using the training set and applying a principle known as the Law of Total Probability.

We derived the analytical bias of this kind of method in a more general setting where we can use both discrete or continuous text features using the Law of Total Expectation. We then used information about that analytical bias to optimize the features used in the calculation of the test set distribution specifically to minimize the estimation bias.

The paper also introduces a text matching step borrowed from causal inference that reduces some of the assumptions of prior methods, allowing text to vary from labeled to unlabeled set in substantial ways while still being able to perform robust estimation. We also outline other ways language can change over time that poses problems even for this improved methodological tool.

Overall, the contribution of the project is to improve upon methods for quantification, helping researchers better understand the balance of opinion in text data that may be changing over time by representing that text data in an explicilty optimized numerical form.


Connor T. Jerzak, Gary King, Anton Strezhnev. An Improved Method of Automated Nonparametric Content Analysis for Social Science. Political Analysis, 31(1): 42-58, 2023.
  title={An Improved Method of Automated Nonparametric Content Analysis for Social Science},
  author={Jerzak, Connor T. and Gary King and Anton Strezhnev},
  journal={Political Analysis},

[Use software]

Related Work

Adel Daoud, Connor T. Jerzak, Richard Johansson. Conceptualizing Treatment Leakage in Text-based Causal Inference. NAACL, 2022.
  title={Conceptualizing Treatment Leakage in Text-based Causal Inference},
  author={Daoud, Adel and Connor T. Jerzak and Richard Johansson},


Back to Research page