Knowledge-based Document Classification with Shannon Entropy

06/06/2022
by   AtMa P. O. Chan, et al.
0

Document classification is the detection specific content of interest in text documents. In contrast to the data-driven machine learning classifiers, knowledge-based classifiers can be constructed based on domain specific knowledge, which usually takes the form of a collection of subject related keywords. While typical knowledge-based classifiers compute a prediction score based on the keyword abundance, it generally suffers from noisy detections due to the lack of guiding principle in gauging the keyword matches. In this paper, we propose a novel knowledge-based model equipped with Shannon Entropy, which measures the richness of information and favors uniform and diverse keyword matches. Without invoking any positive sample, such method provides a simple and explainable solution for document classification. We show that the Shannon Entropy significantly improves the recall at fixed level of false positive rate. Also, we show that the model is more robust against change of data distribution at inference while compared with traditional machine learning, particularly when the positive training samples are very limited.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/25/2021

New identities for the Shannon function and applications

We show how the Shannon entropy function H(p,q)is expressible as a linea...
research
08/10/2018

A Hassle-Free Machine Learning Method for Cohort Selection of Clinical Trials

Traditional text classification techniques in clinical domain have heavi...
research
04/18/2022

AB/BA analysis: A framework for estimating keyword spotting recall improvement while maintaining audio privacy

Evaluation of keyword spotting (KWS) systems that detect keywords in spe...
research
05/24/2022

Boosting Tail Neural Network for Realtime Custom Keyword Spotting

In this paper, we propose a Boosting Tail Neural Network (BTNN) for impr...
research
02/09/2021

CNN Application in Detection of Privileged Documents in Legal Document Review

Protecting privileged communications and data from disclosure is paramou...
research
04/12/2021

Deep learning using Havrda-Charvat entropy for classification of pulmonary endomicroscopy

Pulmonary optical endomicroscopy (POE) is an imaging technology in real ...
research
04/20/2018

Benchmarking Top-K Keyword and Top-K Document Processing with T^2K^2 and T^2K^2D^2

Top-k keyword and top-k document extraction are very popular text analys...

Please sign up or login with your details

Forgot password? Click here to reset