DocBERT: BERT for Document Classification

04/17/2019 ∙ by Ashutosh Adhikari, et al. ∙ University of Waterloo 0

Pre-trained language representation models achieve remarkable state of the art across a wide range of tasks in natural language processing. One of the latest advancements is BERT, a deep pre-trained transformer that yields much better results than its predecessors do. Despite its burgeoning popularity, however, BERT has not yet been applied to document classification. This task deserves attention, since it contains a few nuances: first, modeling syntactic structure matters less for document classification than for other problems, such as natural language inference and sentiment classification. Second, documents often have multiple labels across dozens of classes, which is uncharacteristic of the tasks that BERT explores. In this paper, we describe fine-tuning BERT for document classification. We are the first to demonstrate the success of BERT on this task, achieving state of the art across four popular datasets.



There are no comments yet.


page 1

page 2

page 3

page 4

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Until recently, the dominant paradigm in approaching natural language processing (NLP) tasks has been to concentrate on neural architecture design, using only task-specific data and shallow pre-trained word embeddings, such as GloVe Pennington et al. (2014) and word2vec Mikolov et al. (2013). Numerous literature surveys detail this historical neural progress: Young et al. (2018) describe a series of increasingly intricate neural NLP approaches, all of which follow the classical recipe of training on word embeddings of task-specific data. In their targeted review of sentence-pair modeling, Lan and Xu (2018)

likewise examine neural networks that abide by this paradigm.

The NLP community is, however, witnessing a dramatic paradigm shift toward the pre-trained deep language representation model, which achieves state of the art in question answering, sentiment classification, and similarity modeling, to name a few. Bidirectional Encoder Representations from Transformers (BERT; Devlin et al., 2018) represents one of the latest developments in this line of work. It outperforms its predecessors, ELMo Peters et al. (2018) and GPT Radford et al. , staggeringly exceeding state of the art by a wide margin on multiple natural language understanding tasks.

The approach consists of two stages: first, BERT is pre-trained on vast amounts of text, with an unsupervised objective of masked language modeling and next-sentence prediction. Next, this pre-trained network is then fine-tuned on task-specific, labeled data. Crucially, the pre-trained weights for BERT are already provided—all that is required is to fine-tune the weights on a specific task and dataset, a relatively quick process.

BERT, however, has not yet been fine-tuned for document classification. Why is this worth exploring? For one, modeling syntactic structure is arguably less important for document classification than for BERT’s tasks, such as natural language inference and paraphrasing. This claim is supported by our observation that logistic regression and support vector machines are exceptionally strong document classification baselines. For another, documents often have several labels across many classes, which is again uncharacteristic of the tasks that BERT examines.

Thus, in this paper, we explore fine-tuning BERT for document classification. Our key contribution is that we are the first to demonstrate the success of BERT on document classification tasks and datasets. We establish state-of-the-art results on four popular datasets for this task.

2 Background and Related Work

Over the last few years, neural network-based architectures have achieved state of the art for document classification. Liu et al. (2017) develop XMLCNN for addressing this problem’s multi-label nature, which they call extreme classification. XMLCNN is based on the popular KimCNN Kim (2014)

, except with wider convolutional filters, adaptive dynamic max-pooling 

Chen et al. (2015); Johnson and Zhang (2015), and an additional bottleneck layer to better capture the features of large documents. Another popular model, Hierarchical Attention Network (HAN; Yang et al., 2016

) explicitly models hierarchical information from documents to extract meaningful features, incorporating word- and sentence-level encoders (with attention) to classify documents.

Yang et al. (2018) propose a generative approach for multi-label document classification, using encoder–decoder sequence generation models (SGMs) for generating labels for each document. Contrary to the previous papers, Adhikari et al. (2019) propose LSTM, a simple, properly regularized single-layer BiLSTM, which represents current state of the art.

These task-specific neural architectures have dominated the NLP literature—until recently. Enabled by more computational resources and data, the deep language representation model has greatly improved state of the art on a variety of tasks. Under this paradigm, a neural network is first pre-trained on vast amounts of text under an unsupervised objective (e.g., masked language modeling and next-sentence prediction), and then fine-tuned on task-specific data. The resulting models achieve state of the art in question answering, named-entity recognition, and natural language inference, to name a few. Bidirectional Encoder Representations from Transformers (BERT;

Devlin et al., 2018) currently represents state of the art, vastly outperforming previous models, such as the Generative Pretrained Transformer (GPT; Radford et al., ) and Embeddings from Language Models (ELMo; Peters et al., 2018).

3 Model

Figure 1: Document classification model formed by incorporating BERT with one additional output layer. Figure adapted from Devlin et al. (2018).

We begin with the pre-trained BERT and BERT models, which respectively represent the normal and large model variants Devlin et al. (2018). To adapt BERT for document classification, we follow Devlin et al. (2018) and introduce a fully-connected layer over the final hidden state corresponding to the [CLS] input token, as shown in Figure 1. During fine-tuning, we optimize the entire model end-to-end, with the additional softmax classifier parameters , where is the dimension of the hidden state vectors and is the number of classes. We minimize the cross-entropy and binary cross-entropy loss for single-label and multi-label tasks, respectively.

Reuters 90 10,789 144.3 6.6
AAPD 54 55,840 167.3 1.0
IMDB 10 135,669 393.8 14.4
Yelp 2014 5 1,125,386 148.8 9.1
Table 1: Summary of the datasets. denotes the number of classes in the dataset, the number of samples, and and the average number of words and sentences per document, respectively.
# Model Reuters AAPD IMDB Yelp ’14
Val. Test Val. Test Val. Acc. Test Acc. Val. Acc. Test Acc.
1 LR 77.0 74.8 67.1 64.9 43.1 43.4 61.1 60.9
2 SVM 89.1 86.1 71.1 69.1 42.5 42.4 59.7 59.6
3 KimCNN Repl. 83.5 80.8 54.5 51.4 42.9 42.7 66.5 66.1
4 KimCNN Orig. 37.688footnotemark: 8 61.088footnotemark: 8
5 XML-CNN Repl. 88.8 86.2 70.2 68.7
6 HAN Repl. 87.6 85.2 70.2 68.0 51.8 51.2 68.2 67.9
7 HAN Orig. 49.433footnotemark: 3 70.533footnotemark: 3
8 SGM Orig. 82.5 78.8 71.022footnotemark: 2
9 LSTM 87.6 84.9 72.1 69.6 52.5 52.1 68.6 68.4
10 LSTM 89.1 87.0 73.1 70.5 53.4 52.8 69.0 68.7
11 BERT 90.5 89.0 75.3 73.4 54.4 54.2 72.1 72.0
12 BERT 92.3 90.7 76.6 75.2 56.0 55.6 72.6 72.5
Table 2: Results for each model on the validation and test sets; best values are bolded in blue. Repl. reports the mean of five runs from our reimplementations; Orig.

 refers to point estimates from

22footnotemark: 2Yang et al. (2018), 33footnotemark: 3Yang et al. (2016), and 88footnotemark: 8Tang et al. (2015).

4 Experimental Setup

Using Hedwig,111

a deep learning toolkit with pre-implemented document classification models, we compare the fine-tuned BERT models against HAN, KimCNN, XMLCNN, SGM, and regularized BiLSTMs. For simple yet competitive baselines, we run the default logistic regression (LR) and support vector machine (SVM) implementations from Scikit-Learn 

Pedregosa et al. (2011), trained on the tf–idf vectors of the documents. We use Nvidia Tesla V100 and P100 GPUs for fine-tuning BERT, while we run the rest of the experiments on RTX 2080 Ti and GTX 1080 GPUs.

4.1 Datasets

We use the following four datasets to evaluate BERT: Reuters-21578 (Reuters; Apté et al., 1994), IMDB reviews, arXiv Academic Paper dataset (AAPD; Yang et al., 2018) and Yelp 2014 reviews. Reuters and AAPD are multi-label datasets, while IMDB and Yelp are single label. Table 1 summarizes the statistics of the datasets.

For AAPD, we use the splits provided by Yang et al. (2018); for Reuters, we use the standard ModApté splits Apté et al. (1994). For IMDB and Yelp, following Yang et al. (2016), we randomly sample 80% of the data for training and 10% each for validation and test.

4.2 Training and Hyperparameters

We tune the number of epochs, batch size, learning rate, and maximum sequence length (MSL), the number of tokens that documents are truncated to. We observe that model quality is quite sensitive to the number of epochs, and thus the number must be tailored for each dataset. We fine-tune on Reuters, AAPD, and IMDB for 30, 20 and 4 epochs, respectively. Due to resource constraints, we fine-tune on Yelp for only one epoch. As is the case with

Devlin et al. (2018), we find that choosing a batch size of 16, learning rate of 2, and MSL of 512 tokens yields optimal performance on the validation sets of all datasets.

Hyperparameter study.

To gauge the improvement over the default hyperparameters, as well as to highlight the differences in fine-tuning BERT for document classification, we explore varying several key hyperparameters: namely, the number of epochs and the MSL. Originally,

Devlin et al. (2018) find that fine-tuning for three or four epochs works well for both small and large datasets alike. They also apply a generous MSL of 512, which may be unnecessary for document classification, where fewer tokens may suffice in determining the topic.

Furthermore, while conducting our experiments, we find that even fine-tuning BERT is a computationally intensive task. We argue that it is important to study these two hyperparameters, as they are major determinants of the computational resources required to fine-tune BERT. BERT, for example, requires eight V100s to fine-tune on our datasets, which is of course prohibitive. The number of epochs determines the duration of fine-tuning, while maximum sequence length dictates the models’ memory and computational footprint during both fine-tuning and inference.

Thus, we vary the number of epochs and MSL on a few selected datasets. We choose Reuters and AAPD for analyzing the choice of epochs, since they use a non-standard 20 and 30 epochs, respectively. We select Reuters and IMDB for varying the MSL, since these two datasets differ greatly in average document length: Reuters documents average 144 words, while IMDB averages 394.

Figure 2: From left to right, results on the validation set from varying the MSL and the number of epochs.

5 Results and Discussion

We report the mean scores and accuracy across five runs for the reimplementations of KimCNN, XMLCNN, HAN and LSTM in Table 8. Due to resource limitations, we report the scores from only a single run for BERT and BERT. We also copy the value from Yang et al. (2018) for SGM on AAPD in Table 8, row 8, as we have failed to replicate the authors’ results using their codebase, environment, and data splits.

5.1 Model Quality

Trending with Devlin et al. (2018), BERT achieves state-of-the-art results on all four datasets, followed by BERT (see Table 8, rows 11 and 12). The considerably simpler LSTM model (row 10) achieves a high and accuracy of 87.0 and 52.8, respectively, coming close to the quality of BERT.

Surprisingly, the LR and SVM baselines yield competitive results for the multi-label datasets. For instance, the SVM approaches BERT results on Reuters, with an score of 86.1, astonishingly exceeding most of our neural baselines (rows 2–11). This can also be observed on AAPD, where the SVM surpasses most of the neural models, except SGM, LSTM, and BERT. However, on the single-label datasets, both LR and SVM perform worse than even our simple neural baselines, such as KimCNN. It is worth noting that LR and SVM take only a fraction of the time and resources for training our neural models.

5.2 Hyperparameter Analysis

MSL analysis. A decrease in the MSL corresponds to only a minor loss in on Reuters (see leftmost chart in Figure 2), possibly due to Reuters having shorter sentences. On IMDB (second subplot from left), lowering the MSL corresponds to a drastic fall in accuracy, suggesting that the entire document is necessary for this dataset.

On the one hand, these results appear obvious. Alternatively, one can argue that, since IMDB contains longer documents, truncating tokens may hurt less. Figure 2 shows that this is not the case, since truncating to even 256 tokens causes accuracy to fall lower than that of the much smaller LSTM (see Table 8). From these results, we conclude that any amount of truncation is detrimental in document classification, but the level of degradation may differ.

Epoch analysis. The rightmost two subplots in Figure 2 illustrate the score of BERT fine-tuned using a various number of epochs for AAPD and Reuters. Contrary to Devlin et al. (2018), who achieve state of the art on small datasets with only a few epochs of fine-tuning, we find that smaller datasets require many more epochs to converge. On both the datasets (see Figure 2), we see a significant drop in model quality when the BERT models are fine-tuned on only four epochs, as suggested in the original paper. On Reuters, using four epochs results in an worse than even logistic regression (Table 8, row 1).

6 Conclusion and Future Work

We demonstrate that BERT can be fine-tuned successfully to achieve state of the art across four popular datasets in document classification. We describe and explore a few differences in this task, conducting an analysis for the maximum sequence length and number of epochs.

One direction of future work involves compressing BERT to both fine-tune and run tractably under resource-constrained scenarios. Currently, fine-tuning the large variant is prohibitively expensive, while inference is still too heavyweight for practical deployment.


This research was supported by the Natural Sciences and Engineering Research Council (NSERC) of Canada. This research was enabled in part by resources provided by Compute Ontario and Compute Canada.


  • Adhikari et al. (2019) Ashutosh Adhikari, Achyudh Ram, Raphael Tang, and Jimmy Lin. 2019. Rethinking complex neural network architectures for document classification.
  • Apté et al. (1994) Chidanand Apté, Fred Damerau, and Sholom M. Weiss. 1994. Automated learning of decision rules for text categorization. ACM Transactions on Information Systems, 12(3):233–251.
  • Chen et al. (2015) Yubo Chen, Liheng Xu, Kang Liu, Daojian Zeng, and Jun Zhao. 2015.

    Event extraction via dynamic multi-pooling convolutional neural networks.

    In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), volume 1, pages 167–176.
  • Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805.
  • Johnson and Zhang (2015) Rie Johnson and Tong Zhang. 2015. Effective use of word order for text categorization with convolutional neural networks. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 103–112.
  • Kim (2014) Yoon Kim. 2014. Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pages 1746–1751.
  • Lan and Xu (2018) Wuwei Lan and Wei Xu. 2018. Neural network models for paraphrase identification, semantic textual similarity, natural language inference, and question answering. In Proceedings of the 27th International Conference on Computational Linguistics, pages 3890–3902.
  • Liu et al. (2017) Jingzhou Liu, Wei-Cheng Chang, Yuexin Wu, and Yiming Yang. 2017. Deep learning for extreme multi-label text classification. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 115–124.
  • Mikolov et al. (2013) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119.
  • Pedregosa et al. (2011) F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011.

    Scikit-learn: Machine learning in Python.

    Journal of Machine Learning Research, 12:2825–2830.
  • Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pages 1532–1543.
  • Peters et al. (2018) Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2227–2237.
  • (13) Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training.
  • Tang et al. (2015) Duyu Tang, Bing Qin, and Ting Liu. 2015.

    Document modeling with gated recurrent neural network for sentiment classification.

    In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1422–1432.
  • Yang et al. (2018) Pengcheng Yang, Xu Sun, Wei Li, Shuming Ma, Wei Wu, and Houfeng Wang. 2018. SGM: Sequence generation model for multi-label classification. In Proceedings of the 27th International Conference on Computational Linguistics, pages 3915–3926.
  • Yang et al. (2016) Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy. 2016. Hierarchical attention networks for document classification. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1480–1489.
  • Young et al. (2018) Tom Young, Devamanyu Hazarika, Soujanya Poria, and Erik Cambria. 2018. Recent trends in deep learning based natural language processing. IEEE Computational Intelligence Magazine, 13(3):55–75.