Pairwise Multi-Class Document Classification for Semantic Relations between Wikipedia Articles

03/22/2020
by   Malte Ostendorff, et al.
0

Many digital libraries recommend literature to their users considering the similarity between a query document and their repository. However, they often fail to distinguish what is the relationship that makes two documents alike. In this paper, we model the problem of finding the relationship between two documents as a pairwise document classification task. To find the semantic relation between documents, we apply a series of techniques, such as GloVe, Paragraph-Vectors, BERT, and XLNet under different configurations (e.g., sequence length, vector concatenation scheme), including a Siamese architecture for the Transformer-based systems. We perform our experiments on a newly proposed dataset of 32,168 Wikipedia article pairs and Wikidata properties that define the semantic document relations. Our results show vanilla BERT as the best performing system with an F1-score of 0.93, which we manually examine to better understand its applicability to other domains. Our findings suggest that classifying semantic relations between documents is a solvable task and motivates the development of recommender systems based on the evaluated techniques. The discussions in this paper serve as first steps in the exploration of documents through SPARQL-like queries such that one could find documents that are similar in one aspect but dissimilar in another.

READ FULL TEXT
research
10/13/2020

Aspect-based Document Similarity for Research Papers

Traditional document similarity measures provide a coarse-grained distin...
research
08/01/2020

Contextual Document Similarity for Content-based Literature Recommender Systems

To cope with the ever-growing information overload, an increasing number...
research
03/28/2022

Specialized Document Embeddings for Aspect-based Similarity of Research Papers

Document embeddings and similarity measures underpin content-based recom...
research
10/17/2020

Learning from similarity and information extraction from structured documents

Neural networks have successfully advanced in the task of information ex...
research
07/13/2017

Classifying document types to enhance search and recommendations in digital libraries

In this paper, we address the problem of classifying documents available...
research
11/23/2019

SemEval-2010 Task 8: Multi-Way Classification of Semantic Relations Between Pairs of Nominals

In response to the continuing research interest in computational semanti...
research
12/08/2017

A Method for Finding Similar Documents Relying on Adding Repetition of Symbols in Length Based Filtering

A basic topic in mining of massive dataset is finding similar items. As ...

Please sign up or login with your details

Forgot password? Click here to reset