Using Titles vs. Full-text as Source for Automated Semantic Document Annotation

05/15/2017
by   Lukas Galke, et al.
0

A significant part of the largest Knowledge Graph today, the Linked Open Data cloud, consists of metadata about documents such as publications, news reports, and other media articles. While the widespread access to the document metadata is a tremendous advancement, it is yet not so easy to assign semantic annotations and organize the documents along semantic concepts. Providing semantic annotations like concepts in SKOS thesauri is a classical research topic, but typically it is conducted on the full-text of the documents. For the first time, we offer a systematic comparison of classification approaches to investigate how far semantic annotations can be conducted using just the metadata of the documents such as titles published as labels on the Linked Open Data cloud. We compare the classifications obtained from analyzing the documents' titles with semantic annotations obtained from analyzing the full-text. Apart from the prominent text classification baselines kNN and SVM, we also compare recent techniques of Learning to Rank and neural networks and revisit the traditional methods logistic regression, Rocchio, and Naive Bayes. The results show that across three of our four datasets, the performance of the classifications using only titles reaches over 90 the classification performance when using the full-text. Thus, conducting document classification by just using the titles is a reasonable approach for automated semantic annotation and opens up new possibilities for enriching Knowledge Graphs.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/11/2023

SciKGTeX – A LaTeX Package to Semantically Annotate Contributions in Scientific Publications

Scientific knowledge graphs have been proposed as a solution to structur...
research
03/11/2020

ConceptScope: Organizing and Visualizing Knowledge in Documents based on Domain Ontology

Current text visualization techniques typically provide overviews of doc...
research
02/17/2019

Multiple Document Representations from News Alerts for Automated Bio-surveillance Event Detection

Due to globalization, geographic boundaries no longer serve as effective...
research
05/04/2023

Tuning Traditional Language Processing Approaches for Pashto Text Classification

Today text classification becomes critical task for concerned individual...
research
05/01/2020

Minimally Supervised Categorization of Text with Metadata

Document categorization, which aims to assign a topic label to each docu...
research
11/09/2022

DoSA : A System to Accelerate Annotations on Business Documents with Human-in-the-Loop

Business documents come in a variety of structures, formats and informat...
research
01/20/2018

Using Deep Learning For Title-Based Semantic Subject Indexing To Reach Competitive Performance to Full-Text

For (semi-)automated subject indexing systems in digital libraries, it i...

Please sign up or login with your details

Forgot password? Click here to reset