An Investigation of Transfer Learning-Based Sentiment Analysis in Japanese

by   Enkhbold Bataa, et al.

Text classification approaches have usually required task-specific model architectures and huge labeled datasets. Recently, thanks to the rise of text-based transfer learning techniques, it is possible to pre-train a language model in an unsupervised manner and leverage them to perform effective on downstream tasks. In this work we focus on Japanese and show the potential use of transfer learning techniques in text classification. Specifically, we perform binary and multi-class sentiment classification on the Rakuten product review and Yahoo movie review datasets. We show that transfer learning-based approaches perform better than task-specific models trained on 3 times as much data. Furthermore, these approaches perform just as well for language modeling pre-trained on only 1/30 of the data. We release our pre-trained models and code as open source.



There are no comments yet.


page 1

page 2

page 3

page 4


Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Transfer learning, where a model is first pre-trained on a data-rich tas...

Deep Transfer Learning for Source Code Modeling

In recent years, deep learning models have shown great potential in sour...

Unsupervised Transfer Learning via BERT Neuron Selection

Recent advancements in language representation models such as BERT have ...

COVID-19: Comparative Analysis of Methods for Identifying Articles Related to Therapeutics and Vaccines without Using Labeled Data

Here we proposed an approach to analyze text classification methods base...

SLGPT: Using Transfer Learning to Directly Generate Simulink Model Files and Find Bugs in the Simulink Toolchain

Finding bugs in a commercial cyber-physical system (CPS) development too...

Transfer Learning Robustness in Multi-Class Categorization by Fine-Tuning Pre-Trained Contextualized Language Models

This study compares the effectiveness and robustness of multi-class cate...

Using Pre-Trained Models to Boost Code Review Automation

Code review is a practice widely adopted in open source and industrial p...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Sentiment analysis is a well-studied task in the field of natural language processing and information retrieval

Sadegh et al. (2012); Hussein (2018)

. In the past few years, researchers have made significant progress from models that make use of deep learning techniques.

Kim (2014); Lai et al. (2015); Chen et al. (2017); Lin et al. (2017). However, while there has been significant progress in sentiment analysis for English, not much effort has been invested in analyzing Japanese due to its sparse nature and the dependency on large datasets required by deep learning. We make use of recent transfer learning models such as ELMo Peters et al. (2018), ULMFiT Howard and Ruder (2018), and BERT Devlin et al. (2018) to each pre-train a language model which can then be used to perform downstream tasks. We test the models on binary and multi-class classification.

Figure 1:

Transfer learning-based text classification. First, we train the LM on a large corpus. Then, we fine-tune it on a target corpus. Finally, we train the classifier using labeled examples.

The training process involves three stages as illustrated in Figure 1

. The basic idea is similar to how fine-tuning ImageNet

Deng et al. (2009)

helps many computer vision tasks

Huh et al. (2016). However, this model does not require labeled data for pre-training. Instead, we pre-train a language model in unsupervised manner and then fine-tune it on a domain-specific dataset to efficiently classify using much less data. This is highly desired since there is a lack of large labeled datasets in practice.

2 Contributions

The following are the primary contributions of this paper:

  • We experiment ELMo, ULMFiT and BERT on Japanese datasets including binary and 5-class datasets.

  • We do several ablation studies that are helpful for understanding the effectiveness of transfer learning in Japanese sentiment analysis.

  • We release our pre-trained models and code.

3 Related Work

Here we briefly review the popular neural embeddings and classification model architectures.

3.1 Word Embeddings

Word embedding is defined as the representation of a word as a dense vector. There have been many neural network implementations, including word2vec

Mikolov et al. (2013), fasttext Joulin et al. (2016) and Glove Pennington et al. (2014) that embed using a single layer and achieve state-of-the-art performance in various NLP tasks. However, these embeddings are not context-specific: in the phrases ”I washed my dish” and ”I ate my dish”, the word ”dish” refers to different things but are still represented by the same embedding.

3.2 Contextualized Word Embeddings

Instead of fixed vector embeddings, Cove McCann et al. (2017) uses a machine translation model to embed each word within the context of its sentence. The model includes a bidirectional LSTM encoder and a unidirectional LSTM decoder with attention, and only the encoder is used for downstream task-specific models. However, pretraining is limited by the availability of parallel corpora. (e.g. English-French)

ELMo, short for Embeddings from Language Model Peters et al. (2018)

overcomes this issue by taking advantage of large monolingual data in an unsupervised way. The core foundation of ELMo is the bidirectional language model which learns to predict the probability of a target word given a sentence by combining forward and backward language models. ELMo also requires task-specific models for downstream tasks.

Howard and Ruder (2018) proposed a 3-layer LSTM-based single-model architecture, ULMFiT, that can be used in both pre-training and task-specific fine-tuning. They use novel techniques such as discriminative fine-tuning and slanted triangular learning rates for stable fine-tuning. OpenAI extended the idea by introducing GPT, a multi-layer transformer decoder Radford et al. (2018). While ELMo uses shallow concatenation of forward and backward language models, ULMFiT and OpenAI GPT are unidirectional.

Devlin et al. argues that this limits the power of pre-trained representations by not incorporating bidirectional context, crucial for word-level tasks such as question answering. They proposed a multi-layer transformer encoder-based model, BERT, trained on masked language modeling (MLM) and next sentence prediction (NSP) tasks. MLM allows bidirectional training by randomly masking 15% of words in each sentence in order to predict them, and NSP helps tasks such as question answering by predicting the order of two sentences.

3.3 Text Classification

Many models have been invented for English text classification, including KimCNN Kim (2014), LSTM Chen et al. (2017), Attention Chen et al. (2017), RCNN Lai et al. (2015), etc. However, not much has been done for Japanese. To the best of our knowledge, the current state-of-the-art for Japanese text classification uses shallow (context-free) word embeddings for text classification Peinan and Mamoru (2015); Nio and Murakami (2018). Sun et al. (2018) proposed the Super Characters method that converts sentence classification into image classification by projecting text into images.

Zhang and LeCun (2017) did an extensive study of different ways of encoding Chinese/Japanese/Korean (CJK) and English languages, covering 14 datasets and 473 combinations of different encodings including one-hot, character glyphs, and embeddings and linear, fasttext and CNN models.

This paper investigates transfer learning-based methods for sentiment analysis that is comparable to above mentioned models including Zhang and LeCun (2017) and Sun et al. (2018) for the Japanese language.

4 Dataset

Our work is based on the Japanese Rakuten product review binary and 5 class datasets, provided in Zhang and LeCun (2017) and an arbitrary Yahoo movie review dataset.111 Table 1 provides a summary. The Rakuten dataset is used for comparison purposes, while the Yahoo dataset is used for ablation studies due to its smaller size. For the Rakuten dataset, 80% is used for training, 20% for validation, and the test set is taken from Zhang and LeCun (2017); for the Yahoo dataset, 60% is used for training, 20% for validation, and 20% for testing. We used the Japanese Wikipedia222 for pre-training the language model.

Dataset Classes Train Test
Rakuten full 5 4,000,000 500,000
Rakuten binary 2 3,400,000 400,000
Yahoo binary 2 32073 6109
Table 1: Datasets

5 Training

5.1 Pre-Training Language Model

Pre-training a language model is the most expensive part. We used 1 NVIDIA Quadro GV100 for ULMFiT and 4 NVIDIA Tesla V100s for ELMo. We used WikiExtractor222 for text extraction, tokenized by Mecab444 with IPADIC neologism dictionary555 For BERT, we use the pre-trained BERT model by Kikuta (2019) which used the unsupervised text tokenizer SentencePiece Kudo and Richardson (2018) for tokenization. We didn’t use the BERT multilingual model666 due to its incompatible treatment of Japanese (e.g. accounting for okurigana).777 The models are trained on the most frequent 32000 tokens.

5.2 Fine-Tuning

We use a biattentive classification network (BCN) from McCann et al. (2017) with ELMo as it is known to be state-of-the-art999 on SST Socher et al. (2013) datasets. For fine-tuning ELMo, ULMFiT, and BERT classifiers on a target labeled dataset, we follow the same parameters that were used in the original implementation.101010 We also fine-tune the LM on the target domain corpus before fine-tuning the classifier (please refer to Section 7).

6 Results

In this section, we compare the fine-tuning results of ELMo+BCN, ULMFiT, and BERT with models reported in Zhang and LeCun (2017) and other previous state-of-the-art models.

6.1 Rakuten Datasets

We trained ELMo+BCN and ULMFiT on the Rakuten datasets for several131313To avoid overfitting, we only train for 10 epochs.epochs each and selected the one that performed best. Since BERT fine-tunes all of its layers, we only train for 3 epochs as suggested by Devlin et al. (2018). Results are presented in Table 2. All transfer learning-based methods outperform previous methods on both datasets, showing that these methods work well without being fine-tuned on target corpora. (Results with these models fine-tuned on target corpora are included in Section 7.)

Model Rakuten Binary Rakuten Full
GlyphNet 8.55 48.97
OnehotNet 5.93 45.1
EmbedNet 6.07 45.2
Linear Model 6.63 45.26
Fasttext 5.45 43.27
Super Character 5.15 42.30
BCN+ELMo 4.77 42.95
ULMFiT 4.45 41.39
BERT 4.68 40.68
Table 2: Rakuten test results, in error percentages. Best results from other models (GlyphNet to Super Character) obtained from Zhang and LeCun (2017) and Sun et al. (2018)

6.2 Yahoo movie review dataset

The Yahoo dataset is approximately 112 times smaller than the Rakuten binary dataset. We believe that this dataset better represents real life/practical situations. For establishing a baseline, we trained a simple one-layer RNN and an LSTM with one linear layer on top for classification, as well as convolutional, self-attention, and hybrid state-of-the-art models for comparison. Results shown on Table 3. BERT achieved slightly better performance than fine-tuned ULMFiT but performed worse on the Rakuten binary dataset. We investigate further in Section 7.

Model Yahoo Binary
RNN Baseline 35.29
LSTM Baseline 32.41
KimCNN Kim (2014) 14.25
Self Attention Lin et al. (2017) 13.16
RCNN Lai et al. (2015) 12.67
BCN+ELMo 10.24
ULMFiT 12.20
ULMFiT Adapted 8.52
BERT 8.42
Table 3: Yahoo test results, in error percentages.

7 Ablation Study

7.1 Domain Adaptation

We fine-tune each source language model on the target corpus (without labels) for a few iterations before fine-tuning each classifier. ULMFiT is expected to perform better since it is designed for this use case. The results in Table 4 shows that fine-tuning ULMFiT improves the performance on all datasets while ELMo and BERT shows varied results.

Model Rakuten Binary Rakuten Full
BCN+ELMo 4.76 43.79
ULMFiT 4.18 41.05
BERT [10K steps] 4.94 40.52
BERT [50K steps] 5.52 40.57
Table 4: Domain adapted results. ULMFiT and ELMo are trained for 5 epochs, while BERT is trained for 10K and 50K steps.

7.2 Low-Shot Learning

Low-shot learning refers to the practice of feeding a model with a small amount of training data, contrary to the normal practice of using a large amount of data. We chose the Yahoo dataset for this experiment due to its small size. Experimental results in Table 5 show that, with only 35% of the total dataset, ULMFiT and BERT perform better than task-specific models, while BCN+ELMo shows a comparable result.

Model Yahoo Binary
RNN Baseline 35.29
LSTM Baseline 32.41
KimCNN Kim (2014) 14.25
Self-Attention Lin et al. (2017) 13.16
RCNN Lai et al. (2015) 12.67
BCN+ELMo [35%] 13.51
ULMFiT Adapted [35%] 10.62
BERT [35%] 10.14
Table 5: Low-shot learning results for the Yahoo dataset, in error percentages. Transfer learning-based methods are trained on 35% of the total dataset, while the other models are trained on the whole dataset.

7.3 Size of Pre-Training Corpus

We also investigate whether the size of the source language model affects the sentiment analysis performance on the Yahoo dataset. We used the ja.text8141414 small text corpus (100MB) from the Japanese Wikipedia to compare with the whole Wikipedia (2.9GB) used in our previous experiments. BCN+ELMo performed slight worse, ULMFiT slightly better, and BERT much worse. Thus, for effective sentiment analysis, a large corpus is required for pre-training BERT.

8 Discussion and Future Considerations

This research is a work in progress and will be regularly updated with new benchmarks and baselines. We showed that with only

of the total dataset, transfer learning approaches perform better than previous state-of-the-art models. Furthermore, for sentiment analyses, ELMo and ULMFiT do not require large corpora for pre-training, but BERT does since it is trained on MSM and NSP. Finally, domain adaptation improves the performance of ULMFiT. We believe that our ablation study and the release of pre-trained models will be particularly useful in Japanese text classification. It is important to note that we did not perform K-fold validation due to their high computational cost. In the future, we will investigate other NLP tasks such as named entity recognition (NER), question answering (QA) and aspect-based sentiment analysis (ABSA)

Pontiki et al. (2016).

Model Yahoo Binary
BCN+ELMo 10.24
ULMFiT 12.20
ULMFiT Adapted 8.52
BERT 8.42
BCN+ELMo [small] 10.32
ULMFiT Adapted [small] 8.42
BERT [small] 14.26
Table 6: Comparison of results using large and small corpora. The small corpus is uniformly sampled from the Japanese Wikipedia (100MB). The large corpus is the entire Japanese Wikipedia (2.9GB).

9 Conclusion

Our work showed the possibility of using transfer learning techniques for addressing sentiment classification for the Japanese language. We hope that our experimental results inspire future research dedicated to Japanese.