Linking Datasets on Organizations Using Half-a-Billion Open-Collaborated Records

Brian Libgober | Connor Jerzak

Paper Code Data .bib

Motivation

Linking records is a daily task for many social scientists studying organizations, but one beset by many challenges. What if one dataset refers to an organization one way, but another dataset refers to an organization in another? Think of the case when a researcher is trying to link “MIT” with “Massachusetts Institute of Technology”. For some well-known organizations, human coders may be able to find such matches. But what about for less prominent organizations? Or what about if a researcher’s dataset has thousands or even millions of such entries to match?

Paper Contributions

In this paper, we introduce the LinkedIn employment network as a valuable resource for those seeking to perform such merge tasks involving organizations. We develop three methods whereby the half-a-billion open collaborated LinkedIn records can be used to assist record linkage on organizations. One involves machine learning—using the LinkedIn corpus as a massive training set on what organizational names are used to refer to what other organizational names. Another involves two graph theoretic approaches, treating the LinkedIn name references as a gigantic network summarized via clustering. A final approach combines the machine learning approach with the graph theoretic perspective in a unified method. We show performance in several real-world organizational matching tasks, where we show the potential of the LinkedIn-assisted methods in practice.

We make all methods available as an open-source package at github.com/cjerzak/LinkOrgs-software. Training data are also available through this package interface for those interested in building upon our methods. We’re also considering a new project where we use the trillions of name-match examples on the network to improve personal name matching via Transformer-based neural models.

References

Brian Libgober, Connor T. Jerzak. Linking Datasets on Organizations Using Half-a-Billion Open-Collaborated Records. ArXiv Preprint, 2023.
@article{libgober2023linking,
  title={Linking Datasets on Organizations Using Half-a-Billion Open-Collaborated Records},
  author={Libgober, Brian and Connor T. Jerzak},
  journal={ArXiv Preprint},
  year={2023},
  pages={},
  publisher={}
}
[Data][Code]

Back to Research
Back to Home