Sentiment analysis is a well-studied task in the field of natural language processing and information retrievalSadegh et al. (2012); Hussein (2018)
. In the past few years, researchers have made significant progress from models that make use of deep learning techniques.Kim (2014); Lai et al. (2015); Chen et al. (2017); Lin et al. (2017). However, while there has been significant progress in sentiment analysis for English, not much effort has been invested in analyzing Japanese due to its sparse nature and the dependency on large datasets required by deep learning. We make use of recent transfer learning models such as ELMo Peters et al. (2018), ULMFiT Howard and Ruder (2018), and BERT Devlin et al. (2018) to each pre-train a language model which can then be used to perform downstream tasks. We test the models on binary and multi-class classification.
The training process involves three stages as illustrated in Figure 1
. The basic idea is similar to how fine-tuning ImageNetDeng et al. (2009)
helps many computer vision tasksHuh et al. (2016). However, this model does not require labeled data for pre-training. Instead, we pre-train a language model in unsupervised manner and then fine-tune it on a domain-specific dataset to efficiently classify using much less data. This is highly desired since there is a lack of large labeled datasets in practice.
The following are the primary contributions of this paper:
We experiment ELMo, ULMFiT and BERT on Japanese datasets including binary and 5-class datasets.
We do several ablation studies that are helpful for understanding the effectiveness of transfer learning in Japanese sentiment analysis.
We release our pre-trained models and code.
3 Related Work
Here we briefly review the popular neural embeddings and classification model architectures.
3.1 Word Embeddings
3.2 Contextualized Word Embeddings
Instead of fixed vector embeddings, Cove McCann et al. (2017) uses a machine translation model to embed each word within the context of its sentence. The model includes a bidirectional LSTM encoder and a unidirectional LSTM decoder with attention, and only the encoder is used for downstream task-specific models. However, pretraining is limited by the availability of parallel corpora. (e.g. English-French)
ELMo, short for Embeddings from Language Model Peters et al. (2018)
overcomes this issue by taking advantage of large monolingual data in an unsupervised way. The core foundation of ELMo is the bidirectional language model which learns to predict the probability of a target word given a sentence by combining forward and backward language models. ELMo also requires task-specific models for downstream tasks.
Howard and Ruder (2018) proposed a 3-layer LSTM-based single-model architecture, ULMFiT, that can be used in both pre-training and task-specific fine-tuning. They use novel techniques such as discriminative fine-tuning and slanted triangular learning rates for stable fine-tuning. OpenAI extended the idea by introducing GPT, a multi-layer transformer decoder Radford et al. (2018). While ELMo uses shallow concatenation of forward and backward language models, ULMFiT and OpenAI GPT are unidirectional.
Devlin et al. argues that this limits the power of pre-trained representations by not incorporating bidirectional context, crucial for word-level tasks such as question answering. They proposed a multi-layer transformer encoder-based model, BERT, trained on masked language modeling (MLM) and next sentence prediction (NSP) tasks. MLM allows bidirectional training by randomly masking 15% of words in each sentence in order to predict them, and NSP helps tasks such as question answering by predicting the order of two sentences.
3.3 Text Classification
Many models have been invented for English text classification, including KimCNN Kim (2014), LSTM Chen et al. (2017), Attention Chen et al. (2017), RCNN Lai et al. (2015), etc. However, not much has been done for Japanese. To the best of our knowledge, the current state-of-the-art for Japanese text classification uses shallow (context-free) word embeddings for text classification Peinan and Mamoru (2015); Nio and Murakami (2018). Sun et al. (2018) proposed the Super Characters method that converts sentence classification into image classification by projecting text into images.
Zhang and LeCun (2017) did an extensive study of different ways of encoding Chinese/Japanese/Korean (CJK) and English languages, covering 14 datasets and 473 combinations of different encodings including one-hot, character glyphs, and embeddings and linear, fasttext and CNN models.
Our work is based on the Japanese Rakuten product review binary and 5 class datasets, provided in Zhang and LeCun (2017) and an arbitrary Yahoo movie review dataset.111https://github.com/dennybritz/sentiment-analysis Table 1 provides a summary. The Rakuten dataset is used for comparison purposes, while the Yahoo dataset is used for ablation studies due to its smaller size. For the Rakuten dataset, 80% is used for training, 20% for validation, and the test set is taken from Zhang and LeCun (2017); for the Yahoo dataset, 60% is used for training, 20% for validation, and 20% for testing. We used the Japanese Wikipedia222https://dumps.wikimedia.org/ for pre-training the language model.
5.1 Pre-Training Language Model
Pre-training a language model is the most expensive part. We used 1 NVIDIA Quadro GV100 for ULMFiT and 4 NVIDIA Tesla V100s for ELMo. We used WikiExtractor222https://github.com/attardi/wikiextractor for text extraction, tokenized by Mecab444http://taku910.github.io/mecab/ with IPADIC neologism dictionary555https://github.com/neologd/mecab-ipadic-neologd. For BERT, we use the pre-trained BERT model by Kikuta (2019) which used the unsupervised text tokenizer SentencePiece Kudo and Richardson (2018) for tokenization. We didn’t use the BERT multilingual model666https://github.com/google-research/bert/blob/master/multilingual.md due to its incompatible treatment of Japanese (e.g. accounting for okurigana).777https://github.com/google-research/bert/issues/133888https://github.com/google-research/bert/issues/130 The models are trained on the most frequent 32000 tokens.
We use a biattentive classification network (BCN) from McCann et al. (2017) with ELMo as it is known to be state-of-the-art999https://nlpprogress.com/english/sentiment_analysis.html on SST Socher et al. (2013) datasets. For fine-tuning ELMo, ULMFiT, and BERT classifiers on a target labeled dataset, we follow the same parameters that were used in the original implementation.101010https://github.com/fastai/fastai111111https://github.com/allenai/allennlp121212https://github.com/google-research/bert#fine-tuning-with-bert We also fine-tune the LM on the target domain corpus before fine-tuning the classifier (please refer to Section 7).
In this section, we compare the fine-tuning results of ELMo+BCN, ULMFiT, and BERT with models reported in Zhang and LeCun (2017) and other previous state-of-the-art models.
6.1 Rakuten Datasets
We trained ELMo+BCN and ULMFiT on the Rakuten datasets for several131313To avoid overfitting, we only train for 10 epochs.epochs each and selected the one that performed best. Since BERT fine-tunes all of its layers, we only train for 3 epochs as suggested by Devlin et al. (2018). Results are presented in Table 2. All transfer learning-based methods outperform previous methods on both datasets, showing that these methods work well without being fine-tuned on target corpora. (Results with these models fine-tuned on target corpora are included in Section 7.)
|Model||Rakuten Binary||Rakuten Full|
6.2 Yahoo movie review dataset
The Yahoo dataset is approximately 112 times smaller than the Rakuten binary dataset. We believe that this dataset better represents real life/practical situations. For establishing a baseline, we trained a simple one-layer RNN and an LSTM with one linear layer on top for classification, as well as convolutional, self-attention, and hybrid state-of-the-art models for comparison. Results shown on Table 3. BERT achieved slightly better performance than fine-tuned ULMFiT but performed worse on the Rakuten binary dataset. We investigate further in Section 7.
7 Ablation Study
7.1 Domain Adaptation
We fine-tune each source language model on the target corpus (without labels) for a few iterations before fine-tuning each classifier. ULMFiT is expected to perform better since it is designed for this use case. The results in Table 4 shows that fine-tuning ULMFiT improves the performance on all datasets while ELMo and BERT shows varied results.
|Model||Rakuten Binary||Rakuten Full|
|BERT [10K steps]||4.94||40.52|
|BERT [50K steps]||5.52||40.57|
7.2 Low-Shot Learning
Low-shot learning refers to the practice of feeding a model with a small amount of training data, contrary to the normal practice of using a large amount of data. We chose the Yahoo dataset for this experiment due to its small size. Experimental results in Table 5 show that, with only 35% of the total dataset, ULMFiT and BERT perform better than task-specific models, while BCN+ELMo shows a comparable result.
|KimCNN Kim (2014)||14.25|
|Self-Attention Lin et al. (2017)||13.16|
|RCNN Lai et al. (2015)||12.67|
|ULMFiT Adapted [35%]||10.62|
7.3 Size of Pre-Training Corpus
We also investigate whether the size of the source language model affects the sentiment analysis performance on the Yahoo dataset. We used the ja.text8141414https://github.com/Hironsan/ja.text8 small text corpus (100MB) from the Japanese Wikipedia to compare with the whole Wikipedia (2.9GB) used in our previous experiments. BCN+ELMo performed slight worse, ULMFiT slightly better, and BERT much worse. Thus, for effective sentiment analysis, a large corpus is required for pre-training BERT.
8 Discussion and Future Considerations
This research is a work in progress and will be regularly updated with new benchmarks and baselines. We showed that with only
of the total dataset, transfer learning approaches perform better than previous state-of-the-art models. Furthermore, for sentiment analyses, ELMo and ULMFiT do not require large corpora for pre-training, but BERT does since it is trained on MSM and NSP. Finally, domain adaptation improves the performance of ULMFiT. We believe that our ablation study and the release of pre-trained models will be particularly useful in Japanese text classification. It is important to note that we did not perform K-fold validation due to their high computational cost. In the future, we will investigate other NLP tasks such as named entity recognition (NER), question answering (QA) and aspect-based sentiment analysis (ABSA)Pontiki et al. (2016).
|ULMFiT Adapted [small]||8.42|
Our work showed the possibility of using transfer learning techniques for addressing sentiment classification for the Japanese language. We hope that our experimental results inspire future research dedicated to Japanese.
- Chen et al. (2017) Peng Chen, Zhongqian Sun, Lidong Bing, and Wei Yang. 2017. Recurrent attention network on memory for aspect sentiment analysis. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 452–461.
- Deng et al. (2009) Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database.
- Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
- Howard and Ruder (2018) Jeremy Howard and Sebastian Ruder. 2018. Universal language model fine-tuning for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 328–339.
- Huh et al. (2016) Minyoung Huh, Pulkit Agrawal, and Alexei A. Efros. 2016. What makes ImageNet good for transfer learning? arXiv e-prints, page arXiv:1608.08614.
- Hussein (2018) Doaa Mohey El-Din Mohamed Hussein. 2018. A survey on sentiment analysis challenges. Journal of King Saud University-Engineering Sciences, 30(4):330–338.
- Joulin et al. (2016) Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2016. Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759.
- Kikuta (2019) Yohei Kikuta. 2019. Bert pretrained model trained on japanese wikipedia articles. https://github.com/yoheikikuta/bert-japanese.
- Kim (2014) Yoon Kim. 2014. Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882.
- Kudo and Richardson (2018) Taku Kudo and John Richardson. 2018. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226.
Lai et al. (2015)
Siwei Lai, Liheng Xu, Kang Liu, and Jun Zhao. 2015.
Recurrent convolutional neural networks for text classification.In
Twenty-ninth AAAI conference on artificial intelligence.
- Lin et al. (2017) Zhouhan Lin, Minwei Feng, Cicero Nogueira dos Santos, Mo Yu, Bing Xiang, Bowen Zhou, and Yoshua Bengio. 2017. A structured self-attentive sentence embedding. arXiv preprint arXiv:1703.03130.
- McCann et al. (2017) Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. 2017. Learned in translation: Contextualized word vectors. In Advances in Neural Information Processing Systems, pages 6297–6308.
- Mikolov et al. (2013) Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
- Nio and Murakami (2018) Lasguido Nio and Koji Murakami. 2018.
- Peinan and Mamoru (2015) Zhang Peinan and Komachi Mamoru. 2015. Japanese sentiment classification with stacked denoising auto-encoder using distributed word representation.
- Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543.
- Peters et al. (2018) Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proc. of NAACL.
- Pontiki et al. (2016) Maria Pontiki, Dimitris Galanis, Haris Papageorgiou, Ion Androutsopoulos, Suresh Manandhar, AL-Smadi Mohammad, Mahmoud Al-Ayyoub, Yanyan Zhao, Bing Qin, Orphée De Clercq, et al. 2016. Semeval-2016 task 5: Aspect based sentiment analysis. In Proceedings of the 10th international workshop on semantic evaluation (SemEval-2016), pages 19–30.
- Radford et al. (2018) Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training. URL https://s3-us-west-2. amazonaws. com/openai-assets/research-covers/languageunsupervised/language understanding paper. pdf.
- Sadegh et al. (2012) Mohammad Sadegh, Roliana Ibrahim, and Zulaiha Ali Othman. 2012. Opinion mining and sentiment analysis: A survey. International Journal of Computers & Technology, 2(3):171–178.
- Socher et al. (2013) Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 1631–1642.
- Sun et al. (2018) Baohua Sun, Lin Yang, Patrick Dong, Wenhan Zhang, Jason Dong, and Charles Young. 2018. Super Characters: A Conversion from Sentiment Classification to Image Classification. arXiv e-prints, page arXiv:1810.07653.
- Zhang and LeCun (2017) Xiang Zhang and Yann LeCun. 2017. Which Encoding is the Best for Text Classification in Chinese, English, Japanese and Korean? arXiv e-prints, page arXiv:1708.02657.