Comparative Study of Long Document Classification

11/01/2021
by   Vedangi Wagh, et al.
0

The amount of information stored in the form of documents on the internet has been increasing rapidly. Thus it has become a necessity to organize and maintain these documents in an optimum manner. Text classification algorithms study the complex relationships between words in a text and try to interpret the semantics of the document. These algorithms have evolved significantly in the past few years. There has been a lot of progress from simple machine learning algorithms to transformer-based architectures. However, existing literature has analyzed different approaches on different data sets thus making it difficult to compare the performance of machine learning algorithms. In this work, we revisit long document classification using standard machine learning approaches. We benchmark approaches ranging from simple Naive Bayes to complex BERT on six standard text classification datasets. We present an exhaustive comparison of different algorithms on a range of long document datasets. We re-iterate that long document classification is a simpler task and even basic algorithms perform competitively with BERT-based approaches on most of the datasets. The BERT-based models perform consistently well on all the datasets and can be blindly used for the document classification task when the computations cost is not a concern. In the shallow model's category, we suggest the usage of raw BiLSTM + Max architecture which performs decently across all the datasets. Even simpler Glove + Attention bag of words model can be utilized for simpler use cases. The importance of using sophisticated models is clearly visible in the IMDB sentiment dataset which is a comparatively harder task.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/18/2022

Hierarchical Neural Network Approaches for Long Document Classification

Text classification algorithms investigate the intricate relationships b...
research
07/18/2023

Can Model Fusing Help Transformers in Long Document Classification? An Empirical Study

Text classification is an area of research which has been studied over t...
research
04/14/2022

Revisiting Transformer-based Models for Long Document Classification

The recent literature in text classification is biased towards short tex...
research
07/05/2017

The Influence of Feature Representation of Text on the Performance of Document Classification

In this paper we perform a comparative analysis of three models for feat...
research
09/20/2015

Early text classification: a Naive solution

Text classification is a widely studied problem, and it can be considere...
research
02/27/2018

Convolutional Neural Networks for Toxic Comment Classification

Flood of information is produced in a daily basis through the global Int...
research
02/17/2017

Analysis and Optimization of fastText Linear Text Classifier

The paper [1] shows that simple linear classifier can compete with compl...

Please sign up or login with your details

Forgot password? Click here to reset