Continual Learning for Sentence Representations Using Conceptors

04/18/2019 ∙ by Tianlin Liu, et al. ∙ Jacobs University Bremen University of Pennsylvania 20

Distributed representations of sentences have become ubiquitous in natural language processing tasks. In this paper, we consider a continual learning scenario for sentence representations: Given a sequence of corpora, we aim to optimize the sentence encoder with respect to the new corpus while maintaining its accuracy on the old corpora. To address this problem, we propose to initialize sentence encoders with the help of corpus-independent features, and then sequentially update sentence encoders using Boolean operations of conceptor matrices to learn corpus-dependent features. We evaluate our approach on semantic textual similarity tasks and show that our proposed sentence encoder can continually learn features from new corpora while retaining its competence on previously encountered corpora.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Distributed representations of sentences are essential for a wide variety of natural language processing (NLP) tasks. Although recently proposed sentence encoders have achieved remarkable results (e.g., (Yin and Schütze, 2015; Arora et al., 2017; Cer et al., 2018; Pagliardini et al., 2018)), most, if not all, of them are trained on a priori

fixed corpora. However, in open-domain NLP systems such as conversational agents, oftentimes we are facing a dynamic environment, where training data are accumulated sequentially over time and the distributions of training data vary with respect to external input

(Lee, 2017; Mathur and Singh, 2018). To effectively use sentence encoders in such systems, we propose to consider the following continual sentence representation learning task: Given a sequence of corpora, we aim to train sentence encoders such that they can continually learn features from new corpora while retaining strong performance on previously encountered corpora.

Toward addressing the continual sentence representation learning task, we propose a simple sentence encoder that is based on the summation and linear transform of a sequence of word vectors aided by matrix conceptors. Conceptors have their origin in reservoir computing

Jaeger (2014)

and recently have been used to perform continual learning in deep neural networks

He and Jaeger (2018). Here we employ Boolean operations of conceptor matrices to update sentence encoders over time to meet the following desiderata:

  1. Zero-shot learning. The initialized sentence encoder (no training corpus used) can effectively produce sentence embeddings.

  2. Resistant to catastrophic forgetting. When the sentence encoder is adapted on a new training corpus, it retains strong performances on old ones.

The rest of the paper is organized as follows. We first briefly review a family of linear sentence encoders. Then we explain how to build upon such sentence encoders for continual sentence representation learning tasks, which lead to our proposed algorithm. Finally, we demonstrate the effectiveness of the proposed method using semantic textual similarity tasks.111Our codes are available on GitHub https://github.com/liutianlin0121/contSentEmbed

Notation

We assume each word from a vocabulary set has a real-valued word vector . Let

be the monogram probability of a word

. A corpus is a collection of sentences, where each sentence is a multiset of words (word order is ignored here). For a collection of vectors , where for in an index set with cardinality , we let be a matrix whose columns are vectors

. An identity matrix is denoted by

.

2 Linear sentence encoders

We briefly overview “linear sentence encoders” that are based on linear algebraic operations over a sequence of word vectors. Among different linear sentence encoders, the smoothed inverse frequency (SIF) approach (Arora et al., 2017) is a prominent example – it outperforms many neural-network based sentence encoders on a battery of NLP tasks (Arora et al., 2017).

Derived from a generative model for sentences, the SIF encoder (presented in Algorithm 1) transforms a sequence of word vectors into a sentence vector with three steps. First, for each sentence in the training corpus, SIF computes a weighted average of word vectors (line 1-3 of Algorithm 1

); next, it estimates a “common discourse direction” of the training corpus (line 4 of Algorithm

1); thirdly, for each sentence in the testing corpus, it calculates the weighted average of the word vectors and projects the averaged result away from the learned common discourse direction (line 5-8 of Algorithm 1). Note that this 3-step paradigm is slightly more general than the original one presented in (Arora et al., 2017), where the training and the testing corpus is assumed to be the same.

Input : A training corpus ; a testing corpus ; parameter , monogram probabilities of words
1 for sentence  do
2      
3 end for
4Let be the first singular vector of .
5 for sentence  do
6      
7       .
8 end for
Output : 
Algorithm 1 SIF sentence encoder.

Building upon SIF, recent studies have proposed further improved sentence encoders (Khodak et al., 2018; Pagliardini et al., 2018; Yang et al., 2018). These algorithms roughly share the core procedures of SIF, albeit using more refined methods (e.g., softly remove more than one common discourse direction).

3 Continual learning for linear sentence encoders

In this section, we consider how to design a linear sentence encoder for continual sentence representation learning. We observe that common discourse directions used by SIF-like encoders are estimated from the training corpus. However, incrementally estimating common discourse directions in continual sentence representation learning tasks might not be optimal. For example, consider that we are sequentially given training corpora of tweets and news article. When the first tweets corpus is presented, we can train a SIF sentence encoder using tweets. When the second news article corpus is given, however, we will face a problem on how to exploit the newly given corpus for improving the trained sentence encoder. A straightforward solution is to first combine the tweets and news article corpora and then train a new encoder from scratch using the combined corpus. However, this paradigm is not efficient or effective. It is not efficient in the sense that we will need to re-train the encoder from scratch every time a new corpus is added. Furthermore, it is not effective in the sense that the common direction estimated from scratch reflects a compromise between tweets and news articles, which might not be optimal for either of the stand-alone corpus. Indeed, it is possible that larger corpora will swamp smaller ones.

To make the common discourse learned from one corpus more generalizable to another, we propose to use the conceptor matrix (Jaeger, 2017) to characterize and update the common discourse features in a sequence of training corpora.

3.1 Matrix conceptors

In this section, we briefly introduce matrix conceptors, drawing heavily on (Jaeger, 2017; He and Jaeger, 2018; Liu et al., 2019). Consider a set of vectors , for all . A conceptor matrix is a regularized identity map that minimizes

(1)

where is the Frobenius norm and is a scalar parameter called aperture. It can be shown that has a closed form solution:

(2)

where is a data collection matrix whose columns are vectors from . In intuitive terms, is a soft projection matrix on the linear subspace where the typical components of samples lie. For convenience in notation, we may write to stress the dependence on and .

Conceptors are subject to most laws of Boolean logic such as NOT , AND and OR . For two conceptors and , we define the following operations:

(3)
(4)
(5)

Among these Boolean operations, the OR operation is particularly relevant for our continual sentence representation learning task. It can be shown that is the conceptor computed from the union of the two sets of sample points from which and are computed. Note that, however, to calculate , we only need to know two matrices and and do not have to access to the two sets of sample points from which and are computed.

3.2 Using conceptors to continually learn sentence representations

We now show how to sequentially characterize and update the common discourse of corpora using the Boolean operation of conceptors. Suppose that we are sequentially given training corpora , presented one after another. Without using any training corpus, we first initialize a conceptor which characterizes the corpus-independent common discourse features. More concretely, we compute , where is a matrix of column-wisely stacked word vectors of words from a stop word list and is a hyper-parameter. After initialization, for each new training corpus () coming in, we compute a new conceptor to characterize the common discourse features of corpus , where those are defined in the SIF Algorithm 1. We can then use Boolean operations of conceptors to compute , which characterizes common discourse features from the new corpus as well as the old corpora. After all corpora are presented, we follow the SIF paradigm and use to remove common discourse features from (potentially unseen) sentences. The above outlined conceptor-aided (CA) continual sentence representation learning method is presented in Algorithm 2.

Input : A sequence of training corpora ; a testing corpus ; hyper-parameters and ; word probabilities ; stop word list .
1 .
2 for corpus index  do
3       for sentence  do
4            
5       end for
6      
7      
8 end for
9for  do
10      
11      
12 end for
Output : 
Algorithm 2 CA sentence encoder.

A simple modification of Algorithm 2 yields a “zero-shot” sentence encoder that requires only pre-trained word embeddings and no training corpus: we can simply skip those corpus-dependent steps (line 2-8) and use in place of in line 11 in Algorithm 2 to embed sentences. This method will be referred to as “zero-shot CA.”

[width = ]./figure/cont_CA_large2small.pdf

Figure 1: PCC results of STS datasets. Each panel shows the PCC results of a testing corpus (specified as a subtitle) as a function of increasing numbers of training corpora used. The setup of this experiment mimics (Zenke et al., 2017, section 5.1).
News Captions WordNet Forums Tweets
av. train-from-scratch SIF 66.5 79.7 80.3 55.5 74.2
zero-shot CA 65.6 79.8 82.5 61.5 75.2
av. CA 69.7 83.8 83.2 62.5 76.2
Table 1: Time-course averaged PCC of train-from-scratch SIF and conceptor-aided (CA) methods, together with the result of zero-shot CA. Best results are in boldface and the second best results are underscored.

4 Experiment

We evaluated our approach for continual sentence representation learning using semantic textual similarity (STS) datasets (Agirre et al., 2012, 2013, 2014, 2015, 2016). The evaluation criterion for such datasets is the Pearson correlation coefficient (PCC) between the predicted sentence similarities and the ground-truth sentence similarities. We split these datasets into five corpora by their genre: news, captions, wordnet, forums, tweets (for details see appendix). Throughout this section, we use publicly available 300-dimensional GloVe vectors (trained on the 840 billion token Common Crawl) (Pennington et al., 2014). Additional experiments with Word2Vec (Mikolov et al., 2013), Fasttext (Bojanowski et al., 2017), Paragram-SL-999 (Wieting et al., 2015) are in the appendix.

We use a standard continual learning experiment setup (cf. (Zenke et al., 2017, section 5.1)) as follows. We sequentially present the five training datasets in the order222The order can be arbitrary. Here we ordered the corpora from the one with the largest size (news) to the smallest size (tweets). The results from reversely ordered corpora are reported in the appendix. of news, captions, wordnet, forums, and tweets, to train sentence encoders. Whenever a new training corpus is presented, we train a SIF encoder from scratch333We use as in (Arora et al., 2017). The word frequencies are available at the GitHub repository of SIF. (by combining all available training corpora which have been already presented) and then test it on each corpus. At the same time, we incrementally adapt a CA encoder444We used hyper-parameter . Other parameters are set to be the same as SIF. using the newly presented corpus and test it on each corpus. The lines of each panel of Figure 1 show the test results of SIF and CA on each testing corpus (specified as the panel subtitle) as a function of the number of training corpora used (the first corpora of news, captions, wordnet, forums, and tweets for this experiment). To give a concrete example, consider the blue line in the first panel of Figure 1. This line shows the test PCC scores (-axis) of SIF encoder on the news corpus when the number of training corpora increases (-axis). Specifically, the left-most blue dot indicates the test result of SIF encoder on news corpus when trained on news corpus itself (that is, the first training corpus is used); the second point indicates the test results of SIF encoder on news corpus when trained on news and captions corpora (i.e., the first two training corpora are used); the third point indicates the test results of SIF encoder on news corpus when trained on news, captions, and wordnet corpora (that is, the first three training corpora are used), so on and so forth. The dash-lines in panels show the results of a corpus-specialized SIF, which is trained and tested on the same corpus, i.e., as done in (Arora et al., 2017, section 4.1). We see that the PCC results of CA are better and more “forgetting-resistant” than train-from-scratch SIF throughout the time course where more training data are incorporated. Consider, for example, the test result of news corpus (first panel) again. As more and more training corpora are used, the performance of train-from-scratch SIF drops with a noticeable slope; by contrast, the performance CA drops only slightly.

As remarked in the section 3.2, with a simple modification of CA, we can perform zero-shot sentence representation learning without using any training corpus. The zero-shot learning results are presented in Table 1, together with the time-course averaged results of CA and train-from-scratch SIF (i.e., the averaged values of those CA or SIF scores in each panel of Figure 1). We see that the averaged results of our CA method performs the best among these three methods. Somewhat surprisingly, the results yielded by zero-shot CA are better than the averaged results of train-from-scratch SIF in most of the cases.

We defer additional experiments to the appendix, where we compared CA against more baseline methods and use different word vectors other than GloVe to carry out the experiments.

5 Conclusions and future work

In this paper, we formulated a continual sentence representation learning task: Given a consecutive sequence of corpora presented in a time-course manner, how can we extract useful sentence-level features from new corpora while retaining those from previously seen corpora? We identified that the existing linear sentence encoders usually fall short at solving this task as they leverage on “common discourse” statistics estimated based on a priori fixed corpora. We proposed two sentence encoders (CA encoder and zero-shot CA encoder) and demonstrate their the effectiveness at the continual sentence representation learning task using STS datasets.

As the first paper considering continual sentence representation learning task, this work has been limited in a few ways – it remains for future work to address these limitations. First, it is worthwhile to incorporate more benchmarks such as GLUE (Wang et al., 2019) and SentEval (Conneau and Kiela, 2018) into the continual sentence representation task. Second, this work only considers the case of linear sentence encoder, but future research can attempt to devise (potentially more powerful) non-linear sentence encoders to address the same task. Thirdly, the proposed CA encoder operates at a corpus level, which might be a limitation if boundaries of training corpora are ill-defined. As a future direction, we expect to lift this assumption, for example, by updating the common direction statistics at a sentence level using Autoconceptors (Jaeger, 2014, section 3.14). Finally, the continual learning based sentence encoders should be applied to downstream applications in areas such as open domain NLP systems.

Acknowledgement

The authors thank anonymous reviewers for their helpful feedback. This work was partially supported by João Sedoc’s Microsoft Research Dissertation Grant.

References

The split STS datasets

In the main body of the paper, we have reported that we have used the STS datasets split by genre. A detailed list such STS tasks can be found in Table 2 and can be downloaded from http://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark and http://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark#Companion.

0.75 News Captions Forum Tweets WN MSRpar 2012 MSRvid 2012 deft-forum 2014 tweet-news 2014 OnWN 2012-2014 headlines 2013-2016 images 2014-2015 answers-forums 2015 deft-news 2014 track5.en-en 2017 answer-answer 2016 4299 sentence pairs 3250 sentence pairs 1079 sentence pairs 750 sentence pairs 2061 sentence pairs

Table 2: STS datasets breakdown according to genres.

CA compared with incremental-deletion SIF

We compare the CA approach with the following variant of SIF. In the learning phase, for each corpus coming in, we learn and store a common direction (estimated based on the new corpus). In the testing phase, for a sentence in the testing corpora, we project it away from all common directions we have stored so far. We call this approach SIF with incremental deletions. The testing result is reported in Figure 2.

[width = ]./figure/inc_sif_cont_CA_large2small.pdf

Figure 2: Pearson correlation coefficients (PCC) of the split STS datasets as a function of the number of training corpora. For explanation see text.

CA without stop word initialization

We have also tested the performance of CA without the initializing our concepor by stop words. That is, we set

as a zero matrix in our CA algorithm. The results are reported in Figure

3

[width = ]./figure/no_stop_word_cont_CA_large2small.pdf

Figure 3: Pearson correlation coefficients (PCC) of the split STS datasets as a function of the number of training corpora. For explanation see text.

We see that, the CA initialized by stop words are more beneficial than without such initializations, especially for those testing corpora that are unseen in training data.

CA with the reverse-ordered sequence of training corpora

In the main body of the paper, we sequentially presented new training corpus for sentence encoders, from the corpora of largest size (news) to that of the smallest size (tweets). We have remarked that this choice of ordering is essentially arbitrary. We now report the results for the reverse order (i.e., from corpora of smallest size to that of largest size) in Figure 4. We see that CA approach still outperforms train-from-scratch SIF throughout the time course.

[width = ]./figure/cont_CA_small2large.pdf

Figure 4: Pearson correlation coefficients (PCC) of the split STS datasets as a function of the number of training corpora. For explanation see text.

Experiment using other word embedding brands

We repeat the experiments with Word2Vec (Mikolov et al., 2013)555https://code.google.com/archive/p/word2vec/ (pre-trained on Google News; 3 million tokens), Fasttext (Bojanowski et al., 2017) 666https://fasttext.cc/docs/en/english-vectors.html (pre-trained on Common Crawl; 2 million of tokens), and Paragram SL-999777https://cogcomp.org/page/resource_view/106 (fine-tuned based on GloVe). The pipeline of the experiments echo that of the main body of the paper.

Using Word2vec

[width = ]./figure/w2v_cont_CA_large2small.pdf

Figure 5: Pearson correlation coefficients (PCC) of the split STS datasets as a function of the number of training corpora. Word2Vec is used.

Using Fasttext

[width = ]./figure/fasttext_cont_CA_large2small.pdf

Figure 6: Pearson correlation coefficients (PCC) of the split STS datasets as a function of the number of training corpora. Fasttext is used.

Using Paragram-SL-999

[width = ]./figure/paragram_cont_CA_large2small.pdf

Figure 7: Pearson correlation coefficients (PCC) of the split STS datasets as a function of the number of training corpora. Paragram-SL-999 is used.