Method for Customizable Automated Tagging: Addressing the Problem of Over-tagging and Under-tagging Text Documents

04/30/2020
by   Maharshi R. Pandya, et al.
0

Using author provided tags to predict tags for a new document often results in the overgeneration of tags. In the case where the author doesn't provide any tags, our documents face the severe under-tagging issue. In this paper, we present a method to generate a universal set of tags that can be applied widely to a large document corpus. Using IBM Watson's NLU service, first, we collect keywords/phrases that we call "complex document tags" from 8,854 popular reports in the corpus. We apply LDA model over these complex document tags to generate a set of 765 unique "simple tags". In applying the tags to a corpus of documents, we run each document through the IBM Watson NLU and apply appropriate simple tags. Using only 765 simple tags, our method allows us to tag 87,397 out of 88,583 total documents in the corpus with at least one tag. About 92.1 sufficiently-tagged. In the end, we discuss the performance of our method and its limitations.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/14/2017

DocTag2Vec: An Embedding Based Multi-label Learning Approach for Document Tagging

Tagging news articles or blog posts with relevant tags from a collection...
research
07/30/2015

Tag-Weighted Topic Model For Large-scale Semi-Structured Documents

To date, there have been massive Semi-Structured Documents (SSDs) during...
research
06/10/2019

Patch Transformer for Multi-tagging Whole Slide Histopathology Images

Automated whole slide image (WSI) tagging has become a growing demand du...
research
02/22/2018

MPST: A Corpus of Movie Plot Synopses with Tags

Social tagging of movies reveals a wide range of heterogeneous informati...
research
08/04/2020

Reliable Part-of-Speech Tagging of Historical Corpora through Set-Valued Prediction

Syntactic annotation of corpora in the form of part-of-speech (POS) tags...
research
10/30/2019

Higher Criticism for Discriminating Word-Frequency Tables and Testing Authorship

We adapt the Higher Criticism (HC) goodness-of-fit test to detect change...
research
07/18/2017

AirCode: Unobtrusive Physical Tags for Digital Fabrication

We present AirCode, a technique that allows the user to tag physically f...

Please sign up or login with your details

Forgot password? Click here to reset