A novel text representation which enables image classifiers to perform text classification, applied to name disambiguation

08/19/2019
by   Stephen M. Petrie, et al.
0

Patent data are often used to study the process of innovation and research, but patent databases lack unique identifiers for individual inventors, making it difficult to study innovation processes at the individual level. Here we introduce an algorithm that performs highly accurate disambiguation of inventors (named entities) in US patent data (F1: 99.09 recall: 98.76 text-based record data into abstract image representations, in which text from a given pairwise comparison between two inventor name records is converted into a 2D RGB (stacked) image representation. We train an image classification neural network to discriminate between such pairwise comparison images, and then use the trained network to label each pair of records as either matched (same inventor) or non-matched (different inventors). The resulting disambiguation algorithm produces highly accurate results, out-performing other inventor name disambiguation studies on US patent data. Our new text-to-image representation method could potentially be used more broadly for other NLP comparison problems, as it allows image-based processing techniques (e.g. image classification networks) to be applied to text-based comparison problems (such as disambiguation of academic publications, or data linkage problems).

READ FULL TEXT

page 3

page 4

page 5

page 12

research
05/04/2023

Image Captioners Sometimes Tell More Than Images They See

Image captioning, a.k.a. "image-to-text," which generates descriptive te...
research
04/03/2023

Identifying Mentions of Pain in Mental Health Records Text: A Natural Language Processing Approach

Pain is a common reason for accessing healthcare resources and is a grow...
research
05/08/2020

Comparative Analysis of Text Classification Approaches in Electronic Health Records

Text classification tasks which aim at harvesting and/or organizing info...
research
04/01/2020

An Improved Classification Model for Igbo Text Using N-Gram And K-Nearest Neighbour Approaches

This paper presents an improved classification model for Igbo text using...
research
08/31/2018

Seeing Colors: Learning Semantic Text Encoding for Classification

The question we answer with this work is: can we convert a text document...
research
11/08/2018

Doc2Im: document to image conversion through self-attentive embedding

Text classification is a fundamental task in NLP applications. Latest re...
research
10/04/2011

Identifying relationships between drugs and medical conditions: winning experience in the Challenge 2 of the OMOP 2010 Cup

There is a growing interest in using a longitudinal observational databa...

Please sign up or login with your details

Forgot password? Click here to reset