TextZoo, a New Benchmark for Reconsidering Text Classification

02/10/2018 ∙ by Benyou Wang, et al. ∙ 0

Text representation is a fundamental concern in Natural Language Processing, especially in text classification. Recently, many neural network approaches with delicate representation model (e.g. FASTTEXT, CNN, RNN and many hybrid models with attention mechanisms) claimed that they achieved state-of-art in specific text classification datasets. However, it lacks an unified benchmark to compare these models and reveals the advantage of each sub-components for various settings. We re-implement more than 20 popular text representation models for classification in more than 10 datasets. In this paper, we reconsider the text classification task in the perspective of neural network and get serval effects with analysis of the above results.



There are no comments yet.


page 1

page 2

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In Natural Language Processing or text related community, effective representation of textual sequences is the fundamental topic for the up-stream tasks. Traditionally, bag-of-word models (TFIDF or language model) with vocabulary-aware vector space tends to be the main-stream approach, especially in the task with long text (e.g. ad hoc retrieval with long document, text classification for long sentence). However, it tends to get pool performance in the tasks with short-text sentence (text classification for relatively short sentence, Question answering, machine comprehension and dialogue system), which there are little word-level overlaps in bag-of-word vector space. Distributed representation

(Le and Mikolov, 2014) in a fixed low-dimensional space trained from large-scale corpus have been proposed to enhance the features of text, then break through the performance bottleneck of bag-of-words models in short-text tasks. With combination of Conventional Neural Network (CNN) (Kalchbrenner et al., 2014)

, Recurrent Neural Network (RNN), Recursive Neural Network

(Socher et al., 2013) and Attention, hundreds of models had been proposed to model text for further classification, matching (Fan et al., 2017) or other tasks.

However, these models are tested in different settings with various datasets, preprocessing and even evaluation. Since subtle differences may lead to large divergence in final performance. It is essential to get a robust comparison and tested in rigid significance test. Moreover, models with both effective and efficient performance is impossible due to the No-Free-Lunch principle. Thus each model should be considered in a trade off between its effectiveness and efficiency.

Out contribution is

  1. A new open-source benchmark of text classification 111 Code in https://github.com/wabyking/TextClassificationBenchmark with more than 20 models and 10 datasets.

  2. Systemic reconsideration of text classification in a trade off.

2 Models

Models are shown as follow:

Fastext(Joulin et al., 2016). Sum with all the input embedding.

LSTM. Basic LSTM (Hochreiter and Schmidhuber, 1997) over the input embedding sequence.

BiLSTM. LSTM with forward and backward direction.

StackLSTM. LSTM with multi layers.

Basic CNN. Convolution over the input embedding (Kalchbrenner et al., 2014).

Multi-window CNN (Severyn and Moschitti, 2015).Padding input embedding for a fixed size and concat the feature maps after convolution.

Multi-layer CNN. CNN with multi layers for high-level modelling.

CNN with Inception. CNN with Inception mechanism (Szegedy et al., 2015).

Capsules. CNN with Capsules Networks (Sabour et al., 2017) .

CNN inspired by Quantum. Neural representation inspired by Quantum Theory (Zhang et al., 2018; Niu et al., 2017).

RCNN (Lai et al., 2015). LSTM with pooling mechanism.

CRNN (Zhou et al., 2015). CNN After LSTM .

3 Dataset

There are many datasets as showed in Tab. 1

Dataset Label Vocab. Train Test
20Newsgroups 222 in qwone.com/jason/20Newsgroups/
SST 3334nlp.stanford.edu/sentiment/
IMDB 2 25000 25000
SST-1 5 8544 2210
SST-2 2 6920 1821
SUBJ 2 9000 1000
Table 1: Font guide.

3.1 Evalution

We adopt the Precision as the final evaluation metrics, which is widely used in the classification task.

4 Conclusion

As claimed in the introduction, A benchmark for text classification have been proposed to systemically compare these state-of-art models. Performance, Significance test, Effectiveness-efficiency Discussion, Case study, comparison between RNN and CNN, Embedding sensitive needs to be done.