LinkTransformer: A Unified Package for Record Linkage with Transformer Language Models

09/02/2023
by   Abhishek Arora, et al.
0

Linking information across sources is fundamental to a variety of analyses in social science, business, and government. While large language models (LLMs) offer enormous promise for improving record linkage in noisy datasets, in many domains approximate string matching packages in popular softwares such as R and Stata remain predominant. These packages have clean, simple interfaces and can be easily extended to a diversity of languages. Our open-source package LinkTransformer aims to extend the familiarity and ease-of-use of popular string matching methods to deep learning. It is a general purpose package for record linkage with transformer LLMs that treats record linkage as a text retrieval problem. At its core is an off-the-shelf toolkit for applying transformer models to record linkage with four lines of code. LinkTransformer contains a rich repository of pre-trained transformer semantic similarity models for multiple languages and supports easy integration of any transformer language model from Hugging Face or OpenAI. It supports standard functionality such as blocking and linking on multiple noisy fields. LinkTransformer APIs also perform other common text data processing tasks, e.g., aggregation, noisy de-duplication, and translation-free cross-lingual linkage. Importantly, LinkTransformer also contains comprehensive tools for efficient model tuning, to facilitate different levels of customization when off-the-shelf models do not provide the required accuracy. Finally, to promote reusability, reproducibility, and extensibility, LinkTransformer makes it easy for users to contribute their custom-trained models to its model hub. By combining transformer language models with intuitive APIs that will be familiar to many users of popular string matching packages, LinkTransformer aims to democratize the benefits of LLMs among those who may be less familiar with deep learning frameworks.

READ FULL TEXT
research
06/22/2020

Exploring Software Naturalness throughNeural Language Models

The Software Naturalness hypothesis argues that programming languages ca...
research
09/15/2022

TwHIN-BERT: A Socially-Enriched Pre-trained Language Model for Multilingual Tweet Representations

We present TwHIN-BERT, a multilingual language model trained on in-domai...
research
06/22/2020

Exploring Software Naturalness through Neural Language Models

The Software Naturalness hypothesis argues that programming languages ca...
research
09/14/2019

Ouroboros: On Accelerating Training of Transformer-Based Language Models

Language models are essential for natural language processing (NLP) task...
research
04/03/2023

ScandEval: A Benchmark for Scandinavian Natural Language Processing

This paper introduces a Scandinavian benchmarking platform, ScandEval, w...
research
04/07/2023

Linking Representations with Multimodal Contrastive Learning

Many applications require grouping instances contained in diverse docume...

Please sign up or login with your details

Forgot password? Click here to reset