Forward and Backward Knowledge Transfer for Sentiment Classification

06/08/2019 ∙ by Hao Wang, et al. ∙ Southwest Jiaotong University University of Illinois at Chicago 2

This paper studies the problem of learning a sequence of sentiment classification tasks. The learned knowledge from each task is retained and used to help future or subsequent task learning. This learning paradigm is called Lifelong Learning (LL). However, existing LL methods either only transfer knowledge forward to help future learning and do not go back to improve the model of a previous task or require the training data of the previous task to retrain its model to exploit backward/reverse knowledge transfer. This paper studies reverse knowledge transfer of LL in the context of naive Bayesian (NB) classification. It aims to improve the model of a previous task by leveraging future knowledge without retraining using its training data. This is done by exploiting a key characteristic of the generative model of NB. That is, it is possible to improve the NB classifier for a task by improving its model parameters directly by using the retained knowledge from other tasks. Experimental results show that the proposed method markedly outperforms existing LL baselines.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Lifelong learning (LL) aims to learn a sequence (possibly never ending) of tasks. After a task is learned, its knowledge is retained and later used to help future task learning Thrun (1998); Silver et al. (2013); Chen and Liu (2018). This paper studies LL for sentiment classification (SC). SC classifies an opinion document (e.g., a product review) as expressing a positive or negative sentiment Pang et al. (2008); Cambria et al. (2017). In the LL setting, we are interested in learning a sequence of SC tasks. However, exiting LL methods mainly help future task learning by leveraging the knowledge learned from past tasks, which we call forward knowledge transfer (FKT). If the model of a past task needs to be improved, LL can also use the knowledge learned in future tasks to help, but the past task training data is needed for retraining.

In this paper, we not only want to achieve FKT to improve any future task learning, but also want to achieve reverse knowledge transfer (RKT) to improve the model of any past task without retraining using its training data. This enhanced LL setting is natural as we humans seem to learn in a similar way. We can use some newly learned knowledge to help improve a previous task without re-learning from the past experiences or data, which are often forgotten.

Problem Definition: At any point in time, a learner has learned tasks. Each task () has its training data . The learned knowledge of each task is retained in a knowledge base (KB). When faced with a new task , the knowledge in the KB is leveraged to help learn . After is learned, its knowledge is also incorporated into the KB. The knowledge in the KB can also be used to improve the model of any past task with no retraining using its training data .

In this paper, we propose to deal with the above problem using naïve Bayes (NB) by exploiting its generative model parameters. The proposed method is called Lifelong NB (LNB). The key idea is that the prior knowledge can be mined from the generative model parameters of previous tasks and used to directly revise the generative model parameters of the target task, which can be a new or a past task. Since the generative model parameters are computed during training for each task and retained, then no retraining will be needed when going back to improve the model of any past task.

In summary, this paper makes the following contributions. (1) It studies an enhanced LL setting, i.e., not only improving future task learning, but also going back to improve the past task models with no retaining using their training data. (2) It proposes a novel LL model called Lifelong Naïve Bayes (LNB) for SC in this new setting by exploiting the parameters of the generative model of NB. To the best of our knowledge, this is the first such formulation. (3) It evaluates the effectiveness of the proposed model, which shows that the proposed model makes considerable improvement over state-of-the-art baselines.

2 Related Work

Most related work to ours is lifelong learning (LL) Mitchell et al. (2015); Ruvolo and Eaton (2013); Chen et al. (2015); Fei et al. (2016); Isele et al. (2016); Xia et al. (2017); Shu et al. (2017); Xu et al. (2018); Sun et al. (2018). However, these works cannot improve the model of a past task using the knowledge learned from other tasks without retraining using the training data of the past task. chen2015lifelong proposed the first LL method (called LSC) for sentiment classification (SC) based on optimization considering the past knowledge. However, it also needs retraining using the past training data to improve its model. Xia et al. (2017) presented two LL methods based on voting of individual task classifiers for SC. The first method votes with equal weight for each task classifier, which can be applied to help past tasks. However, this method performs poorly. We will compare with this method experimentally. The second method uses weighted voting, which needs the past task training data for retraining.

Our work is also closely related to transfer learning (TL)

Pan and Yang (2010); Blitzer et al. (2007); Andreevskaia and Bergler (2008); Bollegala et al. (2011); He et al. (2011); Li et al. (2012, 2013); Xia and Zong (2011); Pan et al. (2010); Wu et al. (2009, 2017); Augenstein et al. (2018). However, TL can only uses the source domains to help the target domain, but not the reverse as the target domain has little or no labeled training data.

3 Lifelong Naïve Bayes

3.1 Naïve Bayes for Text Classification

Naïve Bayes (NB) text classifier is a generative model Nigam et al. (1998). Given a document with words , the NB classifier based on multinomial distributions is defined as:


where is positive (+) or negative (-) class in our case, is the number of classes, and the multinomial distribution parameter is


where is the number of times that word occurs in the training documents of class , is the vocabulary size, and is the smoothing parameter. We use as it is shown in Agrawal et al. (2000) that (Lidstone smoothing) is superior to (Laplacian smoothing).

From Eq. (2), we see that NB mainly depends on the word frequency count , which is also the core knowledge that our LNB will retain for each domain in addition to the class prior .

3.2 LNB for Sentiment Classification

In our LNB, there are three key components: Knowledge Miner (KM), Knowledge Base (KB), and Knowledge-Based Learner (KBL). In brief, KM mines knowledge from training data of each task/domain, KB stores the mined knowledge, and KBL abstracts some high-level knowledge from the KB and leverages it in learning the target task/domain , which can be a new domain task or a previous domain task. Following Chen et al. (2015), we treat the classification in each domain (i.e., a type of product) as a learning task. Thus, we use the terms domain and task interchangeably throughout the paper.

The key idea of LNB is to revise the multinomial distribution parameters (i.e., Eq. (2)) for the target task using prior knowledge in KB from the previous tasks to built a better target task classifier. Below we describe how to make this idea work.

Given a task sequence , KM extracts two types of knowledge from the training data of each new task . (1) and : number of times word occurs in the training documents of the positive () and negative () class in , respectively. (2) and : number of training documents in the and class in , respectively. The above two-types of knowledge will be stored in the KB after each new task being learned.

Next, KBL uses the stored knowledge and the newly extracted knowledge by KM (if target task is a new task) to compute the generative model parameters and to build a NB classifier for the target domain . Specially, KBL first abstracts three types of high-level knowledge from the KB:

Word-level knowledge and : number of times word occurs in the training documents of the positive (+) and and negative (-) class in all domains except the target domain, i.e., and .

Target domain-dependent knowledge and

: ratio of word probability in positive (and negative) class vs. negative (positive) class in the target domain

, i.e., and .

Domain-level knowledge and : number of non-target domains in which and , where and

are estimated by using Eq. (

2) in that domain, and is a parameter.

Then, KBL integrates these pieces of knowledge to revise and mined from the target domain. We denote the revised results as and . The intuition here is that if a word can distinguish classes very well in the target domain, we should rely on the target domain. So, we define a set of target domain-dependent words, denoted by . A word belongs to if or , where is a parameter. On the other hand, if a word is reliable among most non-target domains (e.g., half of the non-target domains), we should follow the knowledge associated with this word in the KB. Similarly, we define a set of domain-reliable words, denoted by . A word belongs to if or , where is the number of the non-target domains.

Improving past and future domain classification: It is clear that LNB can treat any past or future domain as the target domain and improve its classification. LNB only needs the frequency count of each word in each class of each (past or future) domain (which is stored in the KB). Thus, for a past/previous domain, no retraining using its original training data is needed. In summary, our LNB model works for a test document in the target domain as shown in Figure 1.

1:  Extract uni-gram features for words ;
2:  for each feature word  do
3:     if  belongs to  then
4:        ,
5:        ;// where 
6:     else if  belongs to  then
7:        ;
8:     else
9:        ;
10:     end if
11:  end for
12:  return  .
Figure 1: LNB classification algorithm for a test document in the target domain .

4 Experiments

Datasets: Since the main baseline is the LSC system, we experiment using the same 20 domains dataset111 as in Chen et al. (2015). Following Chen et al. (2015), we use two versions of the dataset with positive and negative classes in different class distributions, i.e., Natural Class Distribution and Balanced Class Distribution, see supplementary note in Appendix or the work of Chen et al. (2015).

We extracted uni-gram features with no feature selection from the raw reviews. Also, we followed Pang2002b to deal with negation words as chen2015lifelong.

Average F1-score of the negative classes in the Natural Class Distribution Average Accuracy of the two classes in the Balanced Class Distribution
45.20 55.00 56.49 50.39 52.66 59.15 56.62 47.46 64.96 77.40 74.82 80.04 76.09 75.79 79.29 82.09 78.59 83.17
Table 1: New task evaluation: Average sentiment classification (SC) performance over 20 domains.
Sequence Average F1-score of the negative classes in the Natural Class Distribution Average Accuracy of the two classes in the Balanced Class Distribution
S1 43.04 49.66 54.25 51.16 50.21 57.34 54.89 44.25 63.80 78.81 71.84 77.76 76.57 73.42 80.26 81.84 79.47 85.26
S2 44.42 51.16 51.98 51.80 51.13 58.62 53.93 44.32 63.56 78.15 71.71 78.55 76.18 69.73 78.68 81.97 80.26 84.74
S3 42.35 49.37 52.29 49.99 46.70 59.60 51.80 44.76 63.62 78.55 73.02 79.86 75.92 71.31 78.55 81.18 79.21 84.60
S4 42.64 42.22 45.07 50.60 45.22 54.90 50.15 43.87 63.66 78.29 70.00 78.55 75.79 69.34 76.97 79.99 79.07 84.74
S5 42.65 46.16 51.90 50.05 49.54 59.35 48.43 44.46 63.39 78.16 71.97 77.50 75.65 71.97 80.26 80.65 78.94 84.87
S6 42.48 45.56 52.10 50.50 49.30 60.16 53.45 43.83 63.64 78.68 68.55 80.26 75.92 67.89 76.71 81.84 79.60 85.26
S7 45.63 46.28 52.21 51.44 50.15 59.29 52.02 46.78 65.09 78.68 69.73 77.76 75.52 69.34 77.23 79.21 79.34 84.47
S8 43.05 50.74 51.34 51.16 48.96 57.52 52.64 44.25 63.80 78.81 71.44 80.39 76.57 71.84 79.21 82.36 79.47 85.26
S9 43.15 49.61 50.78 50.72 48.41 58.08 54.14 43.15 63.30 78.95 71.97 79.21 76.71 71.05 80.26 81.57 79.60 85.26
S10 42.48 51.23 52.26 50.50 48.95 57.11 51.05 43.83 63.64 78.68 72.10 78.29 75.92 71.97 80.92 80.39 79.60 85.26
() Ave.11footnotemark: 1 43.19 48.20 51.42 50.79 48.86 58.20 52.25 44.35 63.75 78.57 71.23 78.81 76.07 70.78 78.90 81.10 79.46 84.97
  • () Ave. denotes the average value over the above 10 domain sequences (i.e., S1, …, S10).

Table 2: Previous task Evaluation: Average SC performance over 19 previous domains for each sequence.

Baselines: We compare our LNB with NB, SVM Chang and Lin (2011), LSC Chen et al. (2015), and Lifelong Voting (LLV) Xia et al. (2017). For LLV, we use its first voting method that can improve a past model using future knowledge. As traditional NB and SVM only work on a single domain data, we use their variations from Chen et al. (2015). These variations are called NB-T, NB-S, NB-ST, SVM-T, SVM-S and SVM-ST respectively, see our supplementary material.

Settings: For NB, is set to 0.1 for Lidstone smoothing. For SVM, we use the default parameters settings 222See: For LSC and LLV, we use their original parameters settings. For LNB, we empirically set and . We use 5-fold cross validation in evaluation. For the dataset in the natural class distribution, we recorded the F1-score of the negative class. For the dataset in the balanced class distribution, we recorded the Accuracy of the positive and negative classes.

New Task Evaluation: Following Chen et al. (2015), each domain in the 20 domains data is treated as the new (target) domain with the rest 19 domains as the past domains. Table 1 shows the average results over the 20 domains. From the results, we make the following observations:

Our LNB achieves the best F1-scores and Accuracy on two sets of datasets respectively. The results show the superiority of our LNB.

NB-S (and SVM-S) is inferior to NB-T (and SVM-T), both of which are inferior to NB-ST (and SVM-ST). This shows that simply combining the training data from all past domains and the new domain is slightly beneficial, but worse than our model.

LLV performs poorly as its voting does not fit our setting. In Xia et al. (2017), all tasks are from the same domain, but our tasks are from different domains.

Our model is slight better than LSC on the dataset in the balanced class distribution, but markedly better than LSC on the dataset in the natural class distribution 333In Table 1, the F1-scores (56.62) for LSC is not the same as that reported in Chen et al. (2015) because we used 80% reviews of each past domain for training while Chen et al. (2015) used all reviews for training because we need to test on past domains, while Chen et al. (2015) does not do that.. Note that this is not our main results as our main goal is to go back to help previous/past domain models without retraining, which LSC cannot do because LSC needs the past domain data to optimize its objective function.

Previous Task Evaluation: We now evaluate how each previous domain performs after some new/future domains have been learned. Since LSC cannot use the future knowledge to improve past domain models without retraining, for each past domain, we use the classifier built when the past domain was the new domain at that time. For NB-S, NB-ST, SVM-S and SVM-ST, we also use the classifier built at that time. For NB-T and SVM-T, we use the classifier built on each previous domain. For LLV, we use it as it was done because LLV is a voting method, which can use future models to vote in any past domain classification. We give the results after all 20 domains have been learned. Since in this case the ordering of domains may affect the experiment results, we randomly created 10 domain sequences. Due to the space limit, we provide the created 10 domain sequences in the supplementary material. For each sequence, the test results of the previous 19 domains are averaged and the average value is shown in Table 2.

From Table 2, we clearly see that LNB again outperforms all baselines. Although in helping future domain learning on the dataset in the balanced class distribution, LNB is only slightly better than LSC (see Table 1), the ability of LNB to improve past domain models using future knowledge clearly shows its superiority to LSC.

Task Effect: Here we further evaluate the performance of our LNB in helping past domain models using different number of new/future domains (denoted as #future domains). In this experiment, we treat the first domain in each domain sequence as the target domain and vary the number of future domains. The curve of the average test results over 10 domain sequences is shown in Figure 2. The curve clearly shows that LNB performs better with more future domains. This indicates that LNB indeed has the ability to go back to improve the past domain models using future knowledge.

Figure 2: (Left): Effects of #future domains on LNB in natural class distribution. (Right): Effects of #future domains on LNB in balanced class distribution.

5 Conclusions

This paper studied a new lifelong learning (LL) setting where the system uses the knowledge learned in future tasks to improve past task models with no retraining using the training data of the past tasks. We proposed a technique in this new LL setting by exploiting the generative model parameters of naïve Bayes. Experimental results showed the effectiveness of the approach. We believe this new setting is a promising direction for LL because we humans often learn new knowledge to solve past problems and future problems.

Appendix A Appendices

Here we introduce the datasets, the baselines and the 10 domain sequences used in our experiments.

Datasets: It contains a collection of product reviews from 20 types of products or domains from Each domain contains 1000 reviews. Each review has been assigned a sentiment label, i.e., positive (+) or negative (-), based on the rating score. The names of these 20 domains with a serial number for each domain and the proportion of negative reviews are shown in Table 3.

\⃝raisebox{-0.2pt}{\scriptsize{1}} Alarm Clock 30.51 \⃝raisebox{-0.2pt}{\scriptsize{11}} Home Theater System 28.84
\⃝raisebox{-0.2pt}{\scriptsize{2}} Baby 16.45 \⃝raisebox{-0.2pt}{\scriptsize{12}} Jewelry 12.21
\⃝raisebox{-0.2pt}{\scriptsize{3}} Bag 11.97 \⃝raisebox{-0.2pt}{\scriptsize{13}} Keyboard 22.66
\⃝raisebox{-0.2pt}{\scriptsize{4}} Cable Modem 12.53 \⃝raisebox{-0.2pt}{\scriptsize{14}} Magazine Subscriptions 26.88
\⃝raisebox{-0.2pt}{\scriptsize{5}} Dumbbell 16.04 \⃝raisebox{-0.2pt}{\scriptsize{15}} Movies TV 10.86
\⃝raisebox{-0.2pt}{\scriptsize{6}} Flashlight 11.69 \⃝raisebox{-0.2pt}{\scriptsize{16}} Projector 20.24
\⃝raisebox{-0.2pt}{\scriptsize{7}} Jewelry 19.50 \⃝raisebox{-0.2pt}{\scriptsize{17}} Rice Cooker 18.64
\⃝raisebox{-0.2pt}{\scriptsize{8}} Gloves 13.76 \⃝raisebox{-0.2pt}{\scriptsize{18}} Sandal 12.11
\⃝raisebox{-0.2pt}{\scriptsize{9}} Graphics Card 14.58 \⃝raisebox{-0.2pt}{\scriptsize{19}} Vacuum 22.07
\⃝raisebox{-0.2pt}{\scriptsize{10}} Headphone 20.99 \⃝raisebox{-0.2pt}{\scriptsize{20}} Video Games 20.93
Table 3: Names of 20 domains with a serial number for each domain and the proportion of negative reviews in each domain.

In our experiments, we use the following two sets of datasets with different class distribution:

Natural class distribution: We keep the natural distribution of the positive and negative reviews as shown in Table 3

and apply our system to the real-world situation. We use F-score of the negative class in evaluation as each domain has imbalanced class distribution and the number of reviews in the negative class is small.

Balanced class distribution: We also create a balanced dataset from the dataset in the natural class distribution. Each domain in the created dataset has 200 reviews (100 positive and 100 negative). This dataset is small because the number of negative reviews in each domain is small. We use Accuracy of both classes in evaluation as each domain has balanced class distribution.

Baselines: To have a comprehensive comparison, three variations of NB and SVM are created respectively:

NB and SVM trained and tested on the target domain, denoted by NB-T and SVM-T.

NB and SVM trained on the combined data from all non-target domains and tested on the target domain, denoted by NB-S and SVM-S.

NB and SVM trained on the combined data from all domains (including the target domain) and tested on the target domain, denoted by NB-ST and SVM-ST.

We note that NB-T and SVM-T are traditional (non-lifelong learning) methods because they only use the data from the target task. NB-S, NB-ST, SVM-S and SVM-ST can be regarded as simple lifelong methods because they use the data from previous tasks.

Domain Sequences: Since the ordering of domains may affect the experiment results, we randomly created 10 domain sequences. The created 10 domain sequences are shown in Figure 3.

S1: \⃝raisebox{-0.2pt}{\scriptsize{17}}\⃝raisebox{-0.2pt}{\scriptsize{7}}\⃝raisebox{-0.2pt}{\scriptsize{16}}\⃝raisebox{-0.2pt}{\scriptsize{11}}\⃝raisebox{-0.2pt}{\scriptsize{10}}\⃝raisebox{-0.2pt}{\scriptsize{13}}\⃝raisebox{-0.2pt}{\scriptsize{12}}\⃝raisebox{-0.2pt}{\scriptsize{6}}\⃝raisebox{-0.2pt}{\scriptsize{5}}\⃝raisebox{-0.2pt}{\scriptsize{8}}\⃝raisebox{-0.2pt}{\scriptsize{18}}\⃝raisebox{-0.2pt}{\scriptsize{4}}\⃝raisebox{-0.2pt}{\scriptsize{14}}\⃝raisebox{-0.2pt}{\scriptsize{2}}\⃝raisebox{-0.2pt}{\scriptsize{9}}\⃝raisebox{-0.2pt}{\scriptsize{1}}\⃝raisebox{-0.2pt}{\scriptsize{15}}\⃝raisebox{-0.2pt}{\scriptsize{19}}\⃝raisebox{-0.2pt}{\scriptsize{3}}\⃝raisebox{-0.2pt}{\scriptsize{20}}
S2: \⃝raisebox{-0.2pt}{\scriptsize{14}}\⃝raisebox{-0.2pt}{\scriptsize{13}}\⃝raisebox{-0.2pt}{\scriptsize{19}}\⃝raisebox{-0.2pt}{\scriptsize{10}}\⃝raisebox{-0.2pt}{\scriptsize{2}}\⃝raisebox{-0.2pt}{\scriptsize{12}}\⃝raisebox{-0.2pt}{\scriptsize{20}}\⃝raisebox{-0.2pt}{\scriptsize{1}}\⃝raisebox{-0.2pt}{\scriptsize{16}}\⃝raisebox{-0.2pt}{\scriptsize{5}}\⃝raisebox{-0.2pt}{\scriptsize{11}}\⃝raisebox{-0.2pt}{\scriptsize{4}}\⃝raisebox{-0.2pt}{\scriptsize{8}}\⃝raisebox{-0.2pt}{\scriptsize{9}}\⃝raisebox{-0.2pt}{\scriptsize{2}}\⃝raisebox{-0.2pt}{\scriptsize{17}}\⃝raisebox{-0.2pt}{\scriptsize{7}}\⃝raisebox{-0.2pt}{\scriptsize{6}}\⃝raisebox{-0.2pt}{\scriptsize{15}}\⃝raisebox{-0.2pt}{\scriptsize{18}}
S3: \⃝raisebox{-0.2pt}{\scriptsize{5}}\⃝raisebox{-0.2pt}{\scriptsize{7}}\⃝raisebox{-0.2pt}{\scriptsize{17}}\⃝raisebox{-0.2pt}{\scriptsize{9}}\⃝raisebox{-0.2pt}{\scriptsize{4}}\⃝raisebox{-0.2pt}{\scriptsize{16}}\⃝raisebox{-0.2pt}{\scriptsize{15}}\⃝raisebox{-0.2pt}{\scriptsize{12}}\⃝raisebox{-0.2pt}{\scriptsize{10}}\⃝raisebox{-0.2pt}{\scriptsize{14}}\⃝raisebox{-0.2pt}{\scriptsize{11}}\⃝raisebox{-0.2pt}{\scriptsize{8}}\⃝raisebox{-0.2pt}{\scriptsize{19}}\⃝raisebox{-0.2pt}{\scriptsize{18}}\⃝raisebox{-0.2pt}{\scriptsize{6}}\⃝raisebox{-0.2pt}{\scriptsize{13}}\⃝raisebox{-0.2pt}{\scriptsize{2}}\⃝raisebox{-0.2pt}{\scriptsize{3}}\⃝raisebox{-0.2pt}{\scriptsize{20}}\⃝raisebox{-0.2pt}{\scriptsize{1}}
S4: \⃝raisebox{-0.2pt}{\scriptsize{14}}\⃝raisebox{-0.2pt}{\scriptsize{4}}\⃝raisebox{-0.2pt}{\scriptsize{20}}\⃝raisebox{-0.2pt}{\scriptsize{7}}\⃝raisebox{-0.2pt}{\scriptsize{17}}\⃝raisebox{-0.2pt}{\scriptsize{15}}\⃝raisebox{-0.2pt}{\scriptsize{16}}\⃝raisebox{-0.2pt}{\scriptsize{8}}\⃝raisebox{-0.2pt}{\scriptsize{13}}\⃝raisebox{-0.2pt}{\scriptsize{9}}\⃝raisebox{-0.2pt}{\scriptsize{12}}\⃝raisebox{-0.2pt}{\scriptsize{6}}\⃝raisebox{-0.2pt}{\scriptsize{2}}\⃝raisebox{-0.2pt}{\scriptsize{3}}\⃝raisebox{-0.2pt}{\scriptsize{5}}\⃝raisebox{-0.2pt}{\scriptsize{10}}\⃝raisebox{-0.2pt}{\scriptsize{18}}\⃝raisebox{-0.2pt}{\scriptsize{11}}\⃝raisebox{-0.2pt}{\scriptsize{1}}\⃝raisebox{-0.2pt}{\scriptsize{19}}
S5: \⃝raisebox{-0.2pt}{\scriptsize{5}}\⃝raisebox{-0.2pt}{\scriptsize{16}}\⃝raisebox{-0.2pt}{\scriptsize{2}}\⃝raisebox{-0.2pt}{\scriptsize{20}}\⃝raisebox{-0.2pt}{\scriptsize{18}}\⃝raisebox{-0.2pt}{\scriptsize{8}}\⃝raisebox{-0.2pt}{\scriptsize{13}}\⃝raisebox{-0.2pt}{\scriptsize{4}}\⃝raisebox{-0.2pt}{\scriptsize{6}}\⃝raisebox{-0.2pt}{\scriptsize{9}}\⃝raisebox{-0.2pt}{\scriptsize{10}}\⃝raisebox{-0.2pt}{\scriptsize{19}}\⃝raisebox{-0.2pt}{\scriptsize{7}}\⃝raisebox{-0.2pt}{\scriptsize{3}}\⃝raisebox{-0.2pt}{\scriptsize{11}}\⃝raisebox{-0.2pt}{\scriptsize{1}}\⃝raisebox{-0.2pt}{\scriptsize{14}}\⃝raisebox{-0.2pt}{\scriptsize{15}}\⃝raisebox{-0.2pt}{\scriptsize{12}}\⃝raisebox{-0.2pt}{\scriptsize{17}}
S6: \⃝raisebox{-0.2pt}{\scriptsize{15}}\⃝raisebox{-0.2pt}{\scriptsize{11}}\⃝raisebox{-0.2pt}{\scriptsize{4}}\⃝raisebox{-0.2pt}{\scriptsize{20}}\⃝raisebox{-0.2pt}{\scriptsize{17}}\⃝raisebox{-0.2pt}{\scriptsize{3}}\⃝raisebox{-0.2pt}{\scriptsize{7}}\⃝raisebox{-0.2pt}{\scriptsize{10}}\⃝raisebox{-0.2pt}{\scriptsize{16}}\⃝raisebox{-0.2pt}{\scriptsize{12}}\⃝raisebox{-0.2pt}{\scriptsize{18}}\⃝raisebox{-0.2pt}{\scriptsize{1}}\⃝raisebox{-0.2pt}{\scriptsize{2}}\⃝raisebox{-0.2pt}{\scriptsize{13}}\⃝raisebox{-0.2pt}{\scriptsize{5}}\⃝raisebox{-0.2pt}{\scriptsize{8}}\⃝raisebox{-0.2pt}{\scriptsize{19}}\⃝raisebox{-0.2pt}{\scriptsize{6}}\⃝raisebox{-0.2pt}{\scriptsize{9}}\⃝raisebox{-0.2pt}{\scriptsize{14}}
S7: \⃝raisebox{-0.2pt}{\scriptsize{20}}\⃝raisebox{-0.2pt}{\scriptsize{13}}\⃝raisebox{-0.2pt}{\scriptsize{2}}\⃝raisebox{-0.2pt}{\scriptsize{15}}\⃝raisebox{-0.2pt}{\scriptsize{9}}\⃝raisebox{-0.2pt}{\scriptsize{17}}\⃝raisebox{-0.2pt}{\scriptsize{14}}\⃝raisebox{-0.2pt}{\scriptsize{5}}\⃝raisebox{-0.2pt}{\scriptsize{16}}\⃝raisebox{-0.2pt}{\scriptsize{18}}\⃝raisebox{-0.2pt}{\scriptsize{7}}\⃝raisebox{-0.2pt}{\scriptsize{4}}\⃝raisebox{-0.2pt}{\scriptsize{11}}\⃝raisebox{-0.2pt}{\scriptsize{6}}\⃝raisebox{-0.2pt}{\scriptsize{1}}\⃝raisebox{-0.2pt}{\scriptsize{8}}\⃝raisebox{-0.2pt}{\scriptsize{10}}\⃝raisebox{-0.2pt}{\scriptsize{3}}\⃝raisebox{-0.2pt}{\scriptsize{19}}\⃝raisebox{-0.2pt}{\scriptsize{12}}
S8: \⃝raisebox{-0.2pt}{\scriptsize{18}}\⃝raisebox{-0.2pt}{\scriptsize{11}}\⃝raisebox{-0.2pt}{\scriptsize{6}}\⃝raisebox{-0.2pt}{\scriptsize{12}}\⃝raisebox{-0.2pt}{\scriptsize{13}}\⃝raisebox{-0.2pt}{\scriptsize{1}}\⃝raisebox{-0.2pt}{\scriptsize{5}}\⃝raisebox{-0.2pt}{\scriptsize{7}}\⃝raisebox{-0.2pt}{\scriptsize{3}}\⃝raisebox{-0.2pt}{\scriptsize{16}}\⃝raisebox{-0.2pt}{\scriptsize{2}}\⃝raisebox{-0.2pt}{\scriptsize{4}}\⃝raisebox{-0.2pt}{\scriptsize{14}}\⃝raisebox{-0.2pt}{\scriptsize{8}}\⃝raisebox{-0.2pt}{\scriptsize{15}}\⃝raisebox{-0.2pt}{\scriptsize{19}}\⃝raisebox{-0.2pt}{\scriptsize{17}}\⃝raisebox{-0.2pt}{\scriptsize{10}}\⃝raisebox{-0.2pt}{\scriptsize{9}}\⃝raisebox{-0.2pt}{\scriptsize{20}}
S9: \⃝raisebox{-0.2pt}{\scriptsize{16}}\⃝raisebox{-0.2pt}{\scriptsize{2}}\⃝raisebox{-0.2pt}{\scriptsize{10}}\⃝raisebox{-0.2pt}{\scriptsize{9}}\⃝raisebox{-0.2pt}{\scriptsize{12}}\⃝raisebox{-0.2pt}{\scriptsize{19}}\⃝raisebox{-0.2pt}{\scriptsize{6}}\⃝raisebox{-0.2pt}{\scriptsize{11}}\⃝raisebox{-0.2pt}{\scriptsize{18}}\⃝raisebox{-0.2pt}{\scriptsize{20}}\⃝raisebox{-0.2pt}{\scriptsize{4}}\⃝raisebox{-0.2pt}{\scriptsize{1}}\⃝raisebox{-0.2pt}{\scriptsize{3}}\⃝raisebox{-0.2pt}{\scriptsize{17}}\⃝raisebox{-0.2pt}{\scriptsize{5}}\⃝raisebox{-0.2pt}{\scriptsize{13}}\⃝raisebox{-0.2pt}{\scriptsize{15}}\⃝raisebox{-0.2pt}{\scriptsize{8}}\⃝raisebox{-0.2pt}{\scriptsize{14}}\⃝raisebox{-0.2pt}{\scriptsize{7}}
S10: \⃝raisebox{-0.2pt}{\scriptsize{12}}\⃝raisebox{-0.2pt}{\scriptsize{16}}\⃝raisebox{-0.2pt}{\scriptsize{6}}\⃝raisebox{-0.2pt}{\scriptsize{8}}\⃝raisebox{-0.2pt}{\scriptsize{19}}\⃝raisebox{-0.2pt}{\scriptsize{2}}\⃝raisebox{-0.2pt}{\scriptsize{7}}\⃝raisebox{-0.2pt}{\scriptsize{1}}\⃝raisebox{-0.2pt}{\scriptsize{13}}\⃝raisebox{-0.2pt}{\scriptsize{17}}\⃝raisebox{-0.2pt}{\scriptsize{3}}\⃝raisebox{-0.2pt}{\scriptsize{9}}\⃝raisebox{-0.2pt}{\scriptsize{4}}\⃝raisebox{-0.2pt}{\scriptsize{11}}\⃝raisebox{-0.2pt}{\scriptsize{18}}\⃝raisebox{-0.2pt}{\scriptsize{15}}\⃝raisebox{-0.2pt}{\scriptsize{20}}\⃝raisebox{-0.2pt}{\scriptsize{5}}\⃝raisebox{-0.2pt}{\scriptsize{10}}\⃝raisebox{-0.2pt}{\scriptsize{14}}
Figure 3: The randomly created 10 domain sequences.


  • Agrawal et al. (2000) Rakesh Agrawal, Roberto Bayardo, and Ramakrishnan Srikant. 2000. Athena: Mining-based interactive management of text databases. In EDBT, pages 365–379.
  • Andreevskaia and Bergler (2008) Alina Andreevskaia and Sabine Bergler. 2008. When specialists and generalists work together: Overcoming domain dependence in sentiment tagging. In ACL, pages 290–298.
  • Augenstein et al. (2018) Isabelle Augenstein, Sebastian Ruder, and Anders Søgaard. 2018. Multi-task learning of pairwise sequence classification tasks over disparate label spaces. In NAACL, pages 1896–1906.
  • Blitzer et al. (2007) John Blitzer, Mark Dredze, and Fernando Pereira. 2007. Biographies, Bollywood, Boom-boxes and Blenders: Domain Adaptation for Sentiment Classification. In ACL, pages 440–447.
  • Bollegala et al. (2011) Danushka Bollegala, David J Weir, and John Carroll. 2011. Using multiple sources to construct a sentiment sensitive thesaurus for cross-domain sentiment classification. In ACL HLT, pages 132–141.
  • Cambria et al. (2017) Erik Cambria, Soujanya Poria, Alexander Gelbukh, and Mike Thelwall. 2017. Sentiment analysis is a big suitcase. IEEE Intelligent Systems, 32(6):74–80.
  • Chang and Lin (2011) Chih-Chung Chang and Chih-Jen Lin. 2011.

    LIBSVM: A library for support vector machines.

    ACM Transactions on Intelligent Systems and Technology, 2(3):27.
  • Chen and Liu (2018) Zhiyuan Chen and Bing Liu. 2018.

    Lifelong machine learning.

    Synthesis Lectures on Artificial Intelligence and Machine Learning

    , 12(3):1–207.
  • Chen et al. (2015) Zhiyuan Chen, Nianzu Ma, and Bing Liu. 2015. Lifelong learning for sentiment classification. In ACL, pages 750–756.
  • Fei et al. (2016) Geli Fei, Shuai Wang, and Bing Liu. 2016. Learning cumulatively to become more knowledgeable. In KDD, pages 1565–1574.
  • He et al. (2011) Yulan He, Chenghua Lin, and Harith Alani. 2011. Automatically extracting polarity-bearing topics for cross-domain sentiment classification. In ACL, pages 123–131.
  • Isele et al. (2016) David Isele, Mohammad Rostami, and Eric Eaton. 2016. Using task features for zero-shot knowledge transfer in lifelong learning. In IJCAI, pages 1620–1626.
  • Li et al. (2012) Fangtao Li, Sinno Jialin Pan, Ou Jin, Qiang Yang, and Xiaoyan Zhu. 2012.

    Cross-domain co-extraction of sentiment and topic lexicons.

    In ACL, pages 410–419.
  • Li et al. (2013) Shoushan Li, Yunxia Xue, Zhongqing Wang, and Guodong Zhou. 2013. Active learning for cross-domain sentiment classification. In AAAI, pages 2127–2133.
  • Mitchell et al. (2015) Tom M Mitchell, William W Cohen, Estevam R Hruschka Jr, Partha Pratim Talukdar, Justin Betteridge, Andrew Carlson, Bhavana Dalvi Mishra, Matthew Gardner, Bryan Kisiel, Jayant Krishnamurthy, et al. 2015. Never ending learning. In AAAI, pages 2302–2310.
  • Nigam et al. (1998) Kamal Nigam, Andrew McCallum, Sebastian Thrun, Tom Mitchell, et al. 1998. Learning to classify text from labeled and unlabeled documents. AAAI/IAAI, pages 792–799.
  • Pan et al. (2010) Sinno Jialin Pan, Xiaochuan Ni, Jian-Tao Sun, Qiang Yang, and Zheng Chen. 2010. Cross-domain sentiment classification via spectral feature alignment. In WWW, pages 751–760.
  • Pan and Yang (2010) Sinno Jialin Pan and Qiang Yang. 2010. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, 22(10):1345–1359.
  • Pang et al. (2002) Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. 2002. Thumbs up? Sentiment classification using machine learning techniques. In EMNLP, pages 79–86.
  • Pang et al. (2008) Bo Pang, Lillian Lee, et al. 2008. Opinion mining and sentiment analysis. Foundations and Trends® in Information Retrieval, 2(1–2):1–135.
  • Ruvolo and Eaton (2013) Paul Ruvolo and Eric Eaton. 2013. ELLA: An efficient lifelong learning algorithm. In ICML, pages 507–515.
  • Shu et al. (2017) Lei Shu, Hu Xu, and Bing Liu. 2017. Lifelong learning CRF for supervised aspect extraction. In ACL, pages 148–157.
  • Silver et al. (2013) Daniel L Silver, Qiang Yang, and Lianghao Li. 2013. Lifelong machine learning systems: Beyond learning algorithms. In AAAI Spring Symposium: Lifelong Machine Learning, volume 13, pages 48–55.
  • Sun et al. (2018) Gan Sun, Yang Cong, and Xiaowei Xu. 2018. Active lifelong learning with “watchdog”. In AAAI, pages 4107–4114.
  • Thrun (1998) Sebastian Thrun. 1998. Lifelong learning algorithms. In Learning to Learn, pages 181–209. Springer.
  • Wu et al. (2017) Fangzhao Wu, Yongfeng Huang, and Jun Yan. 2017. Active sentiment domain adaptation. In ACL, pages 1701–1711.
  • Wu et al. (2009) Qiong Wu, Songbo Tan, and Xueqi Cheng. 2009. Graph ranking for sentiment transfer. In ACL-IJCNLP, pages 317–320.
  • Xia et al. (2017) Rui Xia, Jie Jiang, and Huihui He. 2017. Distantly supervised lifelong learning for large-scale social media sentiment analysis. IEEE Transactions on Affective Computing, 8(4):480–491.
  • Xia and Zong (2011) Rui Xia and Chengqing Zong. 2011. A POS-based ensemble model for cross-domain sentiment classification. In IJCNLP, pages 614–622.
  • Xu et al. (2018) Hu Xu, Bing Liu, Lei Shu, and Philip S Yu. 2018. Lifelong domain word embedding via meta-learning. In IJCAI, pages 4510–4516.