Multi-view Sentence Representation Learning

05/18/2018 ∙ by Shuai Tang, et al. ∙ University of California, San Diego 0

Multi-view learning can provide self-supervision when different views are available of the same data. The distributional hypothesis provides another form of useful self-supervision from adjacent sentences which are plentiful in large unlabelled corpora. Motivated by the asymmetry in the two hemispheres of the human brain as well as the observation that different learning architectures tend to emphasise different aspects of sentence meaning, we create a unified multi-view sentence representation learning framework, in which, one view encodes the input sentence with a Recurrent Neural Network (RNN), and the other view encodes it with a simple linear model, and the training objective is to maximise the agreement specified by the adjacent context information between two views. We show that, after training, the vectors produced from our multi-view training provide improved representations over the single-view training, and the combination of different views gives further representational improvement and demonstrates solid transferability on standard downstream tasks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Multi-view learning methods provide the ability to extract information from different views of the data and enable self-supervised learning of useful features for future prediction when annotated data is not available

deSa1993LearningCW . Minimising the disagreement among multiple views helps the model to learn rich feature representations of the data and, also after training, the ensemble of the feature vectors from multiple views can provide an even stronger generalisation ability.

The distributional hypothesis harris1954distributional noted that words that occur in similar contexts tend to have similar meaning Turney2010FromFT , and the distributional similarity firth57synopsis

consolidated this idea by stating that the meaning of a word or a sentence can be determined by the company it has. This principle has been widely used in the machine learning community to learn vector representations of human languages. Models built upon distribution similarity don’t explicitly require human-annotated training data; the supervision comes from the semantic continuity of the language data, such as text and speech.

Large quantities of annotated data are usually hard to obtain. Our goal is to propose a learning framework built on the ideas of multi-view learning and the distributional hypothesis to learn from unlabelled data. We draw inspiration from the lateralisation and asymmetry in information processing of the two hemispheres of the human brain where, for most adults, sequential processing dominates the left hemisphere, and the right hemisphere has a focus on parallel processing bryden2012laterality but both hemispheres have been shown to have roles in literal and non-literal language comprehension Coulson2005RightHS ; Coulson2007ASR .

We aim to leverage the functionality of both RNN-based models, which have been widely applied in sentiment analysis tasks

Yang2016HierarchicalAN , and the linear/log-linear models, which have excelled at capturing attributional similarities of words and sentences Arora2016ALV ; Arora2017ASB ; Hill2016LearningDR ; Turney2010FromFT in our multi-view sentence representation learning framework. Our contribution has three folds:

1.

) A new multi-view sentence representation learning framework is proposed, in which one view is an RNN encoder, and the other is a linear average-on-word-vectors encoder; the agreement between views uses cosine similarity between a pair of sentence representations generated from the two views.

2.) We show that each view gets improved in our efficiently trained multi-view learning framework compared to single-view training, and the ensemble of two views provides even better results.

3.

) The proposed model achieves good performance on the unsupervised tasks, and overall outperforms existing unsupervised transfer learning models, and it also shows solid results on supervised tasks, which are either comparable to or better than those of the best unsupervised transfer model.

Our model utilises the distributional similarity as it learns to maximise the agreement among adjacent sentences. Instead of relying on two different input data sources, our model takes the input data, and processes the information in the input data in two independent and distinctive ways, which enables multi-view learning in our proposed learning framework. In addition, our model provides an intriguing hypothesis for a functional role of hemispheric specialisation which emphasises the importance of both hemispheres in language learning and comprehension.

2 Related Work

Learning from the context information guided by the distributional similarity has had great success in learning vector representations of words, such as word2vec Mikolov2013DistributedRO , GloVe Pennington2014GloveGV , and FastText Bojanowski2017EnrichingWV . Joint training of word representations and document representations was also proposed in Le2014DistributedRO .

Our learning framework falls into the same category as described in Kiros2015SkipThoughtV , which is built on the distributional hypothesis harris1954distributional to learn sentence representations. Briefly, skip-thought Kiros2015SkipThoughtV , as well as FastSent Hill2016LearningDR , CNN-LSTM Gan2017LearningGS , etc., use an encoder-decoder model, and the training objective is to maximise the likelihood of generating the surrounding sentences given the current sentence as the input to the encoder. The idea is simple, yet its scalability for very large corpora is hindered by the slow decoding process that dominates training time.

A more intuitive approach is to learn a model that maximises the agreement among the representations of adjacent sentences, and minimises agreement among those not adjacent. A coherence-based learning framework is proposed in Li2014AMO

that trains a model to classify whether the sentences in a triplet are contiguous in a corpus or not. Additional discourse information is also helpful in learning sentence representations

Jernite2017DiscourseBasedOF ; Nie2017DisSentSR but can be costly when dealing with large corpora.

The closest related work includes Siamese Continuous Bag-of-words (Siamese CBOW) Kenter2016SiameseCO , and Quick-thought vectors (QT) logeswaran2018an . Both models maximise the agreement between produced vector representations of adjacent sentences; this objective can be trained on an unlabelled corpus efficiently, and produce sentence representations with rich semantics. Siamese CBOW tunes word vectors to increase the cosine similarity of adjacent sentences, while QT optimises two RNNs to encode the current sentence and the sentences in the context respectively. Although the two RNN encoders are parameterised independently, the way they process the information in the sentences is the same.

The training objective of our model is also to maximise the cosine similarity between adjacent sentences, but, instead of encoding the current sentence and the sentences in the context using the same architecture, our proposed model encodes sentences in two independent and different views. Having two distinctive information processing views encourages the model to encode different aspects of an input sentence, and is beneficial to the future use of the learnt representations.

It is shown in Hill2016LearningDR that the consistency between supervised and unsupervised evaluation tasks is much lower than that within either supervised or unsupervised evaluation tasks alone and that a model that performs well on supervised evaluation tasks may fail on unsupervised tasks. Conneau2017SupervisedLO subsequently showed that, with a labelled training corpus, such as SNLI Bowman2015ALA and MultiNLI Williams2017ABC , the resulting representations of the sentences from the trained model excel in both supervised and unsupervised tasks. Our model is able to achieve good results on both groups of tasks without labelled information.

3 Model Architecture

Our goal is to marry the RNN-based sentence encoder and the avg-on-word-vectors sentence encoder into a unified learning framework with a simple training objective. It is intuitive to think of training a single model to maximise the agreement between the two views of the same sentence, and also between the adjacent sentences based on the distributional similarity firth57synopsis .

The motivation for the idea is that, as mentioned in the prior work, RNN-based encoders process the sentences sequentially, and are able to capture complex syntactic interactions, while the avg-on-word-vectors encoder has been shown to be good at capturing the coarse meaning of a sentence which could be useful for finding paradigmatic parallels Turney2010FromFT .

We present a unified learning framework to learn two sentence encoders in two views jointly; after training, the vectors produced from two encoders of the same sentence input are used to compose the sentence representation. The details of our learning framework are described as follows:

3.1 Encoders in Two Views

Consider a batch of contiguous sentences . For in , there are words, that are transformed into a sequence of word vectors using pretrained word vectors, and passed to two encoders and to produce two vector representations and . The details of the calculation of the representations from and are presented in Table 1.

Encoder : The

encoding function is a bidirectional Gated Recurrent Unit (GRU)

Chung2014EmpiricalEO that has the dimensionality of in each direction; it takes a sequence of word vectors and processes them one at a time, then generates a sequence of hidden states . The hidden state at the last time step is taken as the representation .

Encoder : The

encoding function is simply a single-layer feed-forward neural network, which is a trainable linear projection. As found in prior work

Kenter2016SiameseCO ; Hill2016LearningDR ; Arora2017ASB , linear/log-linear models perform better on sentence similarity tasks measured by the cosine metric. The is calculated as , where is the weight matrix, thus .

3.2 Removing the First Principal Component

The idea of removing the top principal components as a post-processing step was applied in both Arora2017ASB ; Mu2017AllbuttheTopSA , and is incorporated in both the training and testing phases in our learning framework. The Power Iteration mises1929praktische

method is used to efficiently estimate the top principal component in training, and it is removed by

. This step is applied on the representations produced from and individually, and the details are presented in Section 1 in the supplementary material.

3.3 Training Objective

Learning from the distributional similarity is incorporated in our training objective which is to maximise the agreement between the representations of a sentence pair across two views if one sentence in the pair is in the context of the other one. The agreement between two views of a sentence pair is defined as

. The training objective is to minimise the loss function:

(1)

where contains the parameters in , is the parameter matrix for , is the trainable temperature term, which is essential for exaggerating the difference between adjacent sentences and those that are not, and the context window , and the batch size

are hyperparameters.

Phase Training Testing
Supervised Unsupervised
Bi-GRU:
Linear:
Ensemble Concatenation Addition
Table 1: The calculation of representations in training and testing phase. “max()”, “mean()”, and “min()” refer to global max-, mean-, and min-pooling over time, which result in a single vector. The table also presents the diversity of the way that a single sentence representation can be calculated.

4 Experimental Design

3 unlabelled corpora from different genres are used in our experiments, including the BookCorpus (C1) Zhu2015AligningBA , the UMBC News Corpus (C2) han2013umbc_ebiquity and the Amazon Book Review (C3) McAuley2015ImageBasedRO ; the models are trained separately on each of the three corpora. The summary statistics of the three corpora can be found in Table 2. The Adam optimiser Kingma2014AdamAM

and gradient clipping

Pascanu2013OnTD are applied for stable training.

Name # of sentences mean # of words per sentence
BookCorpus (C1)  74M 13
UMBC News (C2)  134.5M 25
Amazon Book Review (C3)  150.8M 19
Table 2: Summary statistics of the three corpora used in our experiments. For simplicity, the three corpora will be referred to as C1, C2 and C3 in the following tables respectively.

All of our experiments including training and testing are done in PyTorch

111http://pytorch.org/. The modified SentEval222https://github.com/facebookresearch/SentEval/ package with the step that removes the first principal component is used to evaluate our models on the downstream tasks. The hyperparameters , and are tuned only on the averaged performance on STS14 of the model trained on the BookCorpus; STS14/C1 performance is thus marked with a in Table 3 and Table 4 to indicate possible overfitting on that dataset/model only.

Task Un. Transfer Arora2017ASB ; Mu2017AllbuttheTopSA ; Kenter2016SiameseCO Su. Transfer Conneau2017SupervisedLO ; Wieting2017RevisitingRN
Multi-view GloVe word2vec ST S- Infer GRAN LSTM
C1 C2 C3 avg tfidf WR proc. bow proc. cbow Sent avg
STS12 60.9 64.0 60.7 52.5 58.7 56.2 54.1 57.2 57.7 30.8 47.5 58.2 62.5 64.8
STS13 60.1 61.7 59.9 42.3 52.1 56.6 57.7 56.8 58.0 24.8 42.9 48.5 63.4 63.1
STS14 71.5 73.7 70.7 54.2 63.8 68.5 59.2 62.9 63.3 31.4 60.4 67.1 75.9 75.8
STS15 76.4 77.2 76.5 52.7 60.6 71.7 57.3 62.7 63.4 31.0 30.7 71.1 75.8 76.7
STS16 75.8 76.7 74.8 - - - - - - - - 71.2 - -
SICK14 74.7 74.9 72.8 69.4 69.4 72.2 67.9 70.1 61.5 49.8 - 73.4 72.9 71.3
Table 3: Results on unsupervised evaluation tasks (Pearson’s ) . Bold numbers are the best results among unsupervised transfer models, and underlined numbers are the best ones among all models. Our models outperform other unsupervised transfer models, and provide comparable results with supervised transfer models. The model trained on UMBC news corpus gives the best results among our three models.

4.1 Unsupervised Evaluation - Textual Similarity Tasks

Representation: For a given sentence input with words, suggested by Pennington2014GloveGV ; Levy2015ImprovingDS , the representation is calculated as , where refers to the post-processed and normalised vector, and is mentioned in Table 1.

Tasks: The unsupervised tasks include five tasks from SemEval Semantic Textual Similarity (STS) in 2012-2016 Agirre2015SemEval2015T2 ; Agirre2014SemEval2014T1 ; Agirre2016SemEval2016T1 ; Agirre2012SemEval2012T6 ; Agirre2013SEM2S and the SemEval2014 Semantic Relatedness task (SICK-R) Marelli2014ASC .

Comparison: We compare our models with (I.) Unsupervised transfer learning: Skip-thought (ST), avg-GloVe, tfidf-GloVe, GloVe+WR, GloVe+proc. Arora2017ASB , word2vec+bow, word2vec+proc. Mu2017AllbuttheTopSA , Siamese CBOW Kenter2016SiameseCO , FastSent Hill2016LearningDR , and QT logeswaran2018an . (II.) Supervised transfer learning: The avg-LSTM and the GRAN Wieting2017RevisitingRN trained on the Paraphrase Database (PPDB) Ganitkevitch2013PPDBTP , the InferSent333We download the released InferSent Conneau2017SupervisedLO , and evaluated the model using the modified SentEval package. Conneau2017SupervisedLO trained on SNLI Bowman2015ALA and MultiNLI Williams2017ABC . The results are presented in Table 3. Since the performance of FastSent and QT was only evaluated on STS14, we compare to their results in Table 4.

All three models trained with our learning framework outperform other unsupervised transfer learning methods, and the model trained on the UMBC New Corpus gives the best performance among our three models. We found that STS tasks contain multiple news- and headlines-related datasets, and UMBC News Corpus matches the domain of these datasets, which might explain the good results provided the model trained with UMBC News. The detailed results on all datasets in STS tasks are presented in the supplementary material.

FastSent Hill2016LearningDR QT logeswaran2018an Multi-view
+AE RNN BOW C1 C2 C3
61.2 59.5 49.0 65.0 71.5 73.7 70.7
Table 4: Comparison with FastSent and QT on STS14 (Pearson’s ).

4.2 Supervised Evaluation

The evaluation on these tasks involves learning a linear model on top of the learnt sentence representations produced by the model, thus it is named supervised evaluation. Since a linear model is capable of picking the most relevant dimensions in the feature vectors to make predictions, it is preferred to concatenate various types of representations to a richer and possibly, more redundant feature vector, which allows the linear model to explore the combination of different aspects to provide better results.

Representation: Inspired by McCann2017LearnedIT , the representation is calculated by concatenating the outputs from the global mean-, max- and min-pooling on top of the hidden states , and the last hidden state, and is calculated with three pooling functions as well. The post-processing and the normalisation step is applied individually. These two representations are concatenated to form a final sentence representation. Table 1 presents the calculation of the representations.

Tasks: Semantic relatedness (SICK) Marelli2014ASC , paraphrase detection (MRPC) Dolan2004UnsupervisedCO , question-type classification (TREC) Li2002LearningQC , movie review sentiment (MR) Pang2005SeeingSE , Stanford Sentiment Treebank (SST) Socher2013RecursiveDM , customer product reviews (CR) Hu2004MiningAS , subjectivity/objectivity classification (SUBJ) Pang2004ASE , opinion polarity (MPQA) Wiebe2005AnnotatingEO .

Comparison: Our results as well as related results of supervised task-dependent training models, supervised transfer learning models, and unsupervised transfer learning models are presented in Table 5. Note that, for the fair comparison, we collect the results of the best single model (MC-QT) trained on BookCorpus in logeswaran2018an .

The three models trained with our learning framework either outperform other existing methods, or achieve similar results on some tasks. The model trained on the Amazon Book Review gives the best performance on sentiment analysis tasks, since the corpus conveys strong sentiment information.

 

Model Hrs SICK-R SICK-E MRPC TREC MR CR SUBJ MPQA SST

 

Supervised task-dependent training - No transfer learning
AdaSent Zhao2015SelfAdaptiveHS - - - - 92.4 83.1 86.3 95.5 93.3 -
TF-KLD Conneau2017SupervisedLO - - - 80.4/85.9 - - - - - -
Supervised training - Transfer learning
InferSent Conneau2017SupervisedLO 24 88.40 86.3 76.2/83.1 88.2 81.1 86.3 92.4 90.2 84.6
Unsupervised training with unordered sentences
TF-IDF Hill2016LearningDR - - - 73.6/81.7 85.0 73.7 79.2 90.3 82.4 -
ParagraphVec Le2014DistributedRO 4 - - 72.9/81.1 59.4 60.2 66.9 76.3 70.7 -
word2vec+bow Conneau2017SupervisedLO 2 80.30 78.7 72.5/81.4 83.6 77.7 79.8 90.9 88.3 79.7
GloVe+bow Conneau2017SupervisedLO - 80.00 78.6 72.1/80.9 83.6 78.7 78.5 91.6 87.6 79.8
GloVe+WR Arora2017ASB - 86.03 84.6 - / - - - - - - 82.2
FastText+bow Mikolov2017AdvancesIP - - - 73.4/81.6 84.0 78.2 81.1 92.5 87.8 82.0
SDAE Hill2016LearningDR 72 - - 73.7/80.7 78.4 74.6 78.0 90.8 86.9 -
Unsupervised training with ordered sentences
FastSent Hill2016LearningDR 2 - - 72.2/80.3 76.8 70.8 78.4 88.7 80.6 -
FastSent+AE Hill2016LearningDR 2 - - 71.2/79.1 80.4 71.8 76.5 88.8 81.5 -
ST Kiros2015SkipThoughtV 336 85.80 82.3 73.0/82.0 92.2 76.5 80.1 93.6 87.1 82.0
ST+LN Ba2016LayerN 720 85.80 79.5 - / - 88.4 79.4 83.1 93.7 89.3 82.9
CNN-LSTM Gan2017LearningGS - 86.18 - 76.5/83.8 92.6 77.8 82.1 93.6 89.4 -
DiscSent Jernite2017DiscourseBasedOF 8 - - 75.0/ - 87.2 - - 93.0 - -
DisSent Nie2017DisSentSR - 79.10 80.3 - / - 84.6 82.5 80.2 92.4 89.6 82.9
MC-QT logeswaran2018an 11 86.80 - 76.9/84.0 92.8 80.4 85.2 93.9 89.4 -
Multi-view C1 3 87.85 84.8 77.1/83.4 91.8 81.6 83.9 94.5 89.1 85.8
Multi-view C2 8.5 87.82 85.2 76.8/83.9 91.6 81.5 82.9 94.7 89.3 84.9
Multi-view C3 8 87.74 85.2 75.7/82.5 89.8 85.0 85.7 95.7 90.0 89.6
Table 5: Results on the supervised evaluation tasks. Bold numbers are the best results among unsupervised transfer models, and underlined numbers are the best ones among all models. “†” refers to an ensemble of two models. “‡” indicates that additional labelled discourse information is required. Our models perform similarly or better than existing methods, but with higher training efficiency.

5 Discussion

5.1 Multi-view Learning vs. Single-view Learning

In order to determine if the multi-view learning with two different views/encoding functions is helping the learning, we compare our model to other reasonable variants, including the multi-view model with two functions of the same type but parameterised independently, either two -s or two -s, and the single-view model with only one or . The comparison is conducted on both BookCorpus and UMBC news. Table 6 presents the results of the models trained on UMBC Corpus.444The comparison on BookCorpus can be found in the supplementary material.

In our multi-view learning with and , the two encoding functions augment each other view. As illustrated in previous work, and specifically emphasised in Hill2016LearningDR , linear/log-linear models, which include in our model, produce better representations for STS tasks than RNN-based models do. The same finding can be observed in Table 6 as well, where consistently provides better results on STS tasks than does. In addition, as we expected, in our multi-view learning with and , augments the performance of on STS tasks. With maximising the agreement between the representations generated from and , it is also expected that improves on other supervised evaluation tasks, as shown in the table.

UMBC News Hrs Avg of STS tasks Avg of Avg of Binary-CLS tasks MRPC
(C2) (STS12-16) SICK-R, STS-B (MR, CR, SUBJ, MPQA, SST)

 

Our Multi-View with and :
8 67.4 83.0 86.6 75.5/82.7
69.2 82.6 85.2 74.3/82.7
70.6 83.0 86.6 76.8/83.9

 

Multi-view with and :
Multi-view with and :
17 49.7 (17.7) 82.2 (0.8) 86.3 (0.3) 75.9/83.0
57.3 (13.3) 81.9 (1.1) 87.1 (0.5) 77.2/83.7
2 68.5 (0.7) 80.8 (1.8) 84.2 (1.0) 72.5/82.0
69.1 (1.5) 77.0 (6.0) 84.5 (2.1) 73.5/82.3
19 67.5 (3.1) 82.3 (0.7) 86.9 (0.3) 76.6/83.8
Single-view with only:
Single-view with only:
9 57.8 (9.6) 81.6 (1.4) 85.8 (0.8) 74.8/82.3
1.5 68.7 (0.5) 81.1 (1.5) 83.3 (1.9) 72.9/81.0
10.5 68.6 (2.0) 82.3 (0.7) 86.3 (0.3) 75.4/82.5
Multi-View with and :
8 48.3 (19.1) 79.9 (3.1) 85.4 (1.2) 74.9/83.1
68.8 (0.4) 81.9 (0.7) 84.0 (1.2) 81.4/73.6
65.7 (4.9) 82.3 (0.7) 86.3 (0.3) 75.9/83.3
Multi-View with and :
8 59.9 (7.5) 80.5 (2.5) 85.2 (1.4) 74.5/82.2
68.5 (0.7) 80.2 (2.4) 83.0 (2.2) 68.5/80.7
67.5 (3.1) 82.0 (1.0) 86.2 (0.4) 75.0/82.4
Table 6: Results of our multi-view model with and and other variants. In the table, “Avg of STS tasks” refers to the mean Pearson’s score on five STS tasks; “Avg of SICK-R, STS-B” refers to the mean Pearson’s score on Sick-Entailment and STS-Benchmark as they both require the same feature engineering methods proposed in Tai2015ImprovedSR ; “Avg of Binary-CLS tasks” refers to the mean accuracy on five sentiment analysis tasks; stands for an ensemble of two representations. The arrow indicates the performance boost () or drop () relative to the same part in our model, e.g., 17.7 indicates the performance of in multi-view with and is 17.7 point lower than that of the in our multi-view model with and . Better view in colour.

In general, an ensemble of the representations generated from two distinct encoding functions performs even better. The two encoding functions, and , have naturally different behaviour. Although the function, which is an RNN, is able to approximate a linear function , the distributional similarity firth57synopsis , which implies that spatially adjacent sentences should be mapped to close vectors555Without the distributional similarity, the model learns a trivial solution that matches the representations produced from and for the same input sentence, since is powerful and able to approximate . In this case, both and collapse, and no useful feature is learnt after training., helps the two functions to learn more generalised representations. Therefore, and encode the input sentence with emphasis on different aspects, and the linear model that is subsequently trained for each of the supervised downstream tasks benefits from this diversity leading to better predictions.

Compared with the ensemble of two multi-view models, each with two encoding functions of the same type, our multi-view model with and provides slightly better results on STS tasks, and similar results on supervised evaluation tasks, while our model has much higher training efficiency. Compared with the ensemble of two single-view models, each with only one encoding function, the matching between and in our multi-view model produces better results.

5.2 Symmetric Agreement Between Two Views

(2)
(3)
(4)

Many choices are plausible for calculating the agreement between the two distinctive views in our proposed learning framework, thus it is important to empirically compare a few reasonable ones. Besides the one we used in our learning framework, two other symmetric agreement functions were tested. The definition of the three agreement functions are listed as Eq. 2, 3, and 4, where denotes the post-processed and normalised vector, denotes the post-processed vector. The results are presented in the last two sections in Table 6.

We found that the model with the agreement in Eq.2, which is used in all of our other experiments, outperforms those with the other two agreement functions. Our explanation is that, in both Eq.3 and Eq.4, maximising the agreement among the representations from one single view is involved, and since the representations produced from the same function, either or , tend to have a similar structure, it is easier to optimise each of the two views to match itself (on neighbouring sentences), instead of the other one, which conflicts with the goal of multi-view learning (see Figure 1).

(a) Trained with our Eq. 2.
(b) Trained with Eq. 3.
Figure 1: Mean cosine similarity of adjacent sentences divided by temperature vs. number of training iterations/10k. “-” refers to . As we can see, during training with Eq. 3, - and - gradually dominate the agreement, thus - gets down-weighted. While with our Eq. 2, and are constantly encouraged to learn from the other.

6 Conclusion

We proposed a unified multi-view sentence representation learning framework that combines an RNN-based encoder and an average-on-word-vectors linear encoder and can be efficiently trained within a few hours on a large unlabelled corpus. The experiments were conducted on three large unlabelled corpora, and meaningful comparisons were made to demonstrate the generalisation ability and the transferability of our learning framework, and also to consolidate our claim. The produced sentence representations outperform existing unsupervised transfer methods on unsupervised evaluation tasks, and match the performance of the best unsupervised model on supervised evaluation tasks.

As presented in our experiments, the ensemble of two views leveraged the advantages of both views, and provides rich semantic information of the input sentence, also the multi-view training helps each view to produce better representations than the single-view training does. Meanwhile, our experimental results also support the finding in Hill2016LearningDR that linear/log-linear models ( in our model) tend to work better on the unsupervised tasks, while RNN-based models ( in our model) generally perform better on the supervised tasks. Future work should explore the relaxation of the cosine similarity metric to incorporate the length information of the produced sentence representations.

Our multi-view learning framework was inspired by the asymmetric information processing in the two hemispheres of the human brain, in which for most adults, the left hemisphere contributes to sequential processing, including primarily language understanding, and the right one carries out more parallel processing, including visual spatial understanding bryden2012laterality . The experimental results raise an intriguing hypothesis about how these two types of information processing may complementarily help learning.

Acknowledgements

We appreciate the gift funding from Adobe Research. Many thanks to Sam Bowman for the thoughtful discussion, and to Mengting Wan, Wangcheng Kang, and Jianmo Ni for critical comments on the project.

References

  • (1) E. Agirre, C. Banea, C. Cardie, D. M. Cer, M. T. Diab, A. Gonzalez-Agirre, W. Guo, I. Lopez-Gazpio, M. Maritxalar, R. Mihalcea, G. Rigau, L. Uria, and J. Wiebe. Semeval-2015 task 2: Semantic textual similarity, english, spanish and pilot on interpretability. In SemEval@NAACL-HLT, 2015.
  • (2) E. Agirre, C. Banea, C. Cardie, D. M. Cer, M. T. Diab, A. Gonzalez-Agirre, W. Guo, R. Mihalcea, G. Rigau, and J. Wiebe. Semeval-2014 task 10: Multilingual semantic textual similarity. In SemEval@COLING, 2014.
  • (3) E. Agirre, C. Banea, D. M. Cer, M. T. Diab, A. Gonzalez-Agirre, R. Mihalcea, G. Rigau, and J. Wiebe. Semeval-2016 task 1: Semantic textual similarity, monolingual and cross-lingual evaluation. In SemEval@NAACL-HLT, 2016.
  • (4) E. Agirre, D. M. Cer, M. T. Diab, and A. Gonzalez-Agirre. Semeval-2012 task 6: A pilot on semantic textual similarity. In SemEval@NAACL-HLT, 2012.
  • (5) E. Agirre, D. M. Cer, M. T. Diab, A. Gonzalez-Agirre, and W. Guo. *sem 2013 shared task: Semantic textual similarity. In *SEM@NAACL-HLT, 2013.
  • (6) S. Arora, Y. Li, Y. Liang, T. Ma, and A. Risteski. A latent variable model approach to pmi-based word embeddings. TACL, 4:385–399, 2016.
  • (7) S. Arora, Y. Liang, and T. Ma. A simple but tough-to-beat baseline for sentence embeddings. In International Conference on Learning Representations, 2017.
  • (8) J. Ba, R. Kiros, and G. E. Hinton. Layer normalization. CoRR, abs/1607.06450, 2016.
  • (9) P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov. Enriching word vectors with subword information. TACL, 5:135–146, 2017.
  • (10) S. R. Bowman, G. Angeli, C. Potts, and C. D. Manning. A large annotated corpus for learning natural language inference. In EMNLP, 2015.
  • (11) M. Bryden. Laterality functional asymmetry in the intact brain. Elsevier, 2012.
  • (12) J. Chung, C. Gulcehre, K. Cho, and Y. Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555, 2014.
  • (13) A. Conneau, D. Kiela, H. Schwenk, L. Barrault, and A. Bordes. Supervised learning of universal sentence representations from natural language inference data. In EMNLP, 2017.
  • (14) S. Coulson, K. D. Federmeier, C. van Petten, and M. Kutas. Right hemisphere sensitivity to word- and sentence-level context: evidence from event-related brain potentials. Journal of experimental psychology. Learning, memory, and cognition, 31 1:129–47, 2005.
  • (15) S. Coulson and C. van Petten. A special role for the right hemisphere in metaphor comprehension? erp evidence from hemifield presentation. Brain research, 1146:128–45, 2007.
  • (16) V. R. de Sa. Learning classification with unlabeled data. In NIPS, pages 112–119, 1993.
  • (17) W. B. Dolan, C. Quirk, and C. Brockett. Unsupervised construction of large paraphrase corpora: Exploiting massively parallel news sources. In COLING, 2004.
  • (18) J. R. Firth. A synopsis of linguistic theory. 1957.
  • (19) Z. Gan, Y. Pu, R. Henao, C. Li, X. He, and L. Carin.

    Learning generic sentence representations using convolutional neural networks.

    In EMNLP, 2017.
  • (20) J. Ganitkevitch, B. V. Durme, and C. Callison-Burch. Ppdb: The paraphrase database. In HLT-NAACL, 2013.
  • (21) L. Han, A. L. Kashyap, T. Finin, J. Mayfield, and J. Weese. Umbc_ebiquity-core: semantic textual similarity systems. In Second Joint Conference on Lexical and Computational Semantics (* SEM), Volume 1: Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity, volume 1, pages 44–52, 2013.
  • (22) Z. S. Harris. Distributional structure. Word, 10(2-3):146–162, 1954.
  • (23) K. He, X. Zhang, S. Ren, and J. Sun.

    Delving deep into rectifiers: Surpassing human-level performance on imagenet classification.

    2015 IEEE International Conference on Computer Vision (ICCV)

    , pages 1026–1034, 2015.
  • (24) F. Hill, K. Cho, and A. Korhonen.

    Learning distributed representations of sentences from unlabelled data.

    In HLT-NAACL, 2016.
  • (25) M. Hu and B. Liu. Mining and summarizing customer reviews. In KDD, 2004.
  • (26) Y. Jernite, S. R. Bowman, and D. Sontag. Discourse-based objectives for fast unsupervised sentence representation learning. CoRR, abs/1705.00557, 2017.
  • (27) T. Kenter, A. Borisov, and M. de Rijke. Siamese cbow: Optimizing word embeddings for sentence representations. In ACL, 2016.
  • (28) D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • (29) J. R. Kiros, Y. Zhu, R. Salakhutdinov, R. S. Zemel, R. Urtasun, A. Torralba, and S. Fidler. Skip-thought vectors. In NIPS, 2015.
  • (30) Q. V. Le and T. Mikolov. Distributed representations of sentences and documents. In ICML, 2014.
  • (31) O. Levy, Y. Goldberg, and I. Dagan. Improving distributional similarity with lessons learned from word embeddings. TACL, 3:211–225, 2015.
  • (32) J. Li and E. H. Hovy. A model of coherence based on distributed sentence representation. In EMNLP, 2014.
  • (33) X. Li and D. Roth. Learning question classifiers. In COLING, 2002.
  • (34) L. Logeswaran and H. Lee. An efficient framework for learning sentence representations. In International Conference on Learning Representations, 2018.
  • (35) M. Marelli, S. Menini, M. Baroni, L. Bentivogli, R. Bernardi, and R. Zamparelli. A sick cure for the evaluation of compositional distributional semantic models. In LREC, 2014.
  • (36) J. J. McAuley, C. Targett, Q. Shi, and A. van den Hengel. Image-based recommendations on styles and substitutes. In SIGIR, 2015.
  • (37) B. McCann, J. Bradbury, C. Xiong, and R. Socher. Learned in translation: Contextualized word vectors. In NIPS, 2017.
  • (38) T. Mikolov, E. Grave, P. Bojanowski, C. Puhrsch, and A. Joulin. Advances in pre-training distributed word representations. CoRR, abs/1712.09405, 2017.
  • (39) T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In NIPS, 2013.
  • (40) R. Mises and H. Pollaczek-Geiringer. Praktische verfahren der gleichungsauflösung. ZAMM-Journal of Applied Mathematics and Mechanics/Zeitschrift für Angewandte Mathematik und Mechanik, 9(1):58–77, 1929.
  • (41) J. Mu, S. Bhat, and P. Viswanath. All-but-the-top: Simple and effective postprocessing for word representations. In International Conference on Learning Representations, 2018.
  • (42) A. Nie, E. D. Bennett, and N. D. Goodman. Dissent: Sentence representation learning from explicit discourse relations. CoRR, abs/1710.04334, 2017.
  • (43) B. Pang and L. Lee. A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. In ACL, 2004.
  • (44) B. Pang and L. Lee. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In ACL, 2005.
  • (45) R. Pascanu, T. Mikolov, and Y. Bengio. On the difficulty of training recurrent neural networks. In ICML, 2013.
  • (46) J. Pennington, R. Socher, and C. D. Manning. Glove: Global vectors for word representation. In EMNLP, 2014.
  • (47) R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Manning, A. Ng, and C. Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In EMNLP, 2013.
  • (48) K. S. Tai, R. Socher, and C. D. Manning.

    Improved semantic representations from tree-structured long short-term memory networks.

    In ACL, 2015.
  • (49) P. D. Turney and P. Pantel. From frequency to meaning: Vector space models of semantics. J. Artif. Intell. Res., 37:141–188, 2010.
  • (50) J. Wiebe, T. Wilson, and C. Cardie. Annotating expressions of opinions and emotions in language. Language Resources and Evaluation, 39:165–210, 2005.
  • (51) J. Wieting and K. Gimpel. Revisiting recurrent networks for paraphrastic sentence embeddings. In ACL, 2017.
  • (52) A. Williams, N. Nangia, and S. R. Bowman. A broad-coverage challenge corpus for sentence understanding through inference. CoRR, abs/1704.05426, 2017.
  • (53) Z. Yang, D. Yang, C. Dyer, X. He, A. J. Smola, and E. H. Hovy. Hierarchical attention networks for document classification. In HLT-NAACL, 2016.
  • (54) H. Zhao, Z. Lu, and P. Poupart. Self-adaptive hierarchical sentence model. In IJCAI, 2015.
  • (55) Y. Zhu, R. Kiros, R. S. Zemel, R. Salakhutdinov, R. Urtasun, A. Torralba, and S. Fidler. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. ICCV, pages 19–27, 2015.

1 Power Iteration

The Power Iteration was proposed in mises1929praktische

, and it is an efficient algorithm for estimating the top eigenvector of a given covariance matrix. Here, it is used to estimate the top principal component from the representations produced from

and separately. We omit the superscription here, since the same step is applied to both and .

Suppose there is a batch of representations from either or , the Power Iteration method is applied here to estimate the top eigenvector of the covariance matrix666In practice, usually is less than , thus we estimate the top eigenvector of .: , and it is described in Algorithm 1:

1:Input: Covariance matrix , number of iterations
2:Output: First principal component
3:Initialise a unit length vector
4:for ,  do
5:     ,
6:     
Algorithm 1 Estimating the First Principal Component (Power Iteration mises1929praktische )

2 Detailed Results on STS tasks

Every year, STS task has multiple datasets, so detailed comparison on every dataset is helpful to understand the behaviour of our model and the related work. We present the detailed results on all datasets in Table 1, and since FastSent Hill2016LearningDR and QT logeswaran2018an only reported their performance on STS14, we use a separate Table 2 to compare their models with ours.

We tuned the hyperparameters in our model trained on the BookCorpus Zhu2015AligningBA on the averaged Pearson’s score on STS14, and it is clear that our model performs better than others on STS14 on average. Although there are some dataset overlaps among STS12, 13, 14, 15 and 16, our model still works better other models on those datasets that are not in overlap with STS14, which means that our model demonstrates a solid generalisation ability and transferability.

-0.7cm Dataset Un. Transfer Arora2017ASB ; Mu2017AllbuttheTopSA ; Kenter2016SiameseCO Su. Transfer Conneau2017SupervisedLO ; Wieting2017RevisitingRN Multi-view GloVe word2vec ST S- Infer GRAN LSTM C1 C2 C3 avg tfidf WR proc. bow proc. cbow Sent avg MSRpar 40.3 43.0 40.1 47.7 50.3 35.6 44.1 42.1 43.9 16.8 43.8 40.0 47.7 49.0 MSRvid 85.4 87.8 84.8 63.9 77.9 83.8 68.1 72.1 72.2 41.7 45.2 82.8 85.2 84.3 SMTeuro 51.2 54.2 51.1 46.0 54.7 49.9 45.3 53.2 54.3 35.2 45.0 49.6 49.3 51.2 OnWN 74.2 74.8 72.8 55.1 64.7 66.2 65.7 69.4 69.5 29.7 64.4 59.6 71.5 71.5 SMTnews 53.3 60.3 54.5 59.6 45.7 45.6 47.2 49.4 48.5 30.8 39.0 59.3 58.7 68.0 STS’12 60.9 64.0 46.8 52.5 58.7 56.2 54.1 57.2 57.7 30.8 47.5 58.2 62.5 64.8 FNWN 46.3 47.9 46.8 34.2 36.6 39.4 39.3 40.7 42.0 30.4 23.2 26.3 55.6 53.2 headlines 69.9 74.4 73.4 63.8 63.7 64.7 57.2 61.9 63.8 34.6 65.3 66.4 76.1 77.3 OnWN 83.4 82.9 80.2 49.0 75.2 82.8 58.6 67.9 68.2 10.0 49.9 69.2 81.4 81.2 SMT 40.8 41.5 39.3 22.3 29.6 37.9 - - - 24.3 33.1 32.0 40.3 40.7 STS’13 60.1 61.7 59.9 42.3 52.1 56.6 57.7 56.8 58.0 24.8 42.9 48.5 63.4 63.1 deft-forum 51.0 51.0 44.3 27.1 37.5 41.2 29.4 32.2 33.3 12.9 40.8 42.4 55.7 56.6 deft-news 67.6 73.3 72.0 68.0 68.7 69.4 71.5 66.8 66.0 23.5 59.1 73.3 77.1 78.0 headlines 66.8 71.8 68.4 59.5 63.7 64.7 52.6 58.0 59.6 37.8 63.6 61.7 72.8 74.5 images 83.1 86.2 84.1 61.0 72.5 82.6 68.3 73.8 74.2 51.2 65.0 78.5 85.8 84.7 OnWN 84.2 84.1 81.7 58.4 75.2 82.8 67.6 74.6 74.8 23.3 60.7 76.5 85.1 84.9 tweet-news 76.1 75.8 73.4 51.2 65.1 70.1 66.1 71.9 72.1 39.9 75.2 70.0 78.7 76.3 STS’14 71.5 73.4 70.7 54.2 63.8 68.5 59.2 62.9 63.3 31.4 60.4 67.1 75.9 75.8 answers-forums 72.0 72.6 72.7 30.5 45.6 63.9 39.9 46.4 46.8 36.1 21.8 60.5 73.1 71.8 answers-students 74.3 71.0 74.7 63.0 63.9 70.4 62.4 68.1 68.0 33.0 36.7 68.0 72.9 71.1 belief 79.0 77.9 75.9 40.5 49.5 71.8 57.7 59.7 60.4 24.6 47.7 71.5 78.0 75.3 headlines 72.7 77.9 75.7 61.8 70.9 70.7 53.3 61.5 63.5 43.6 21.5 70.4 78.6 79.5 images 84.3 86.4 83.8 67.5 72.9 81.5 73.2 78.1 78.1 17.7 25.6 85.0 85.8 85.8 STS’15 76.4 77.2 76.5 52.7 60.6 71.7 57.3 62.7 63.4 31.0 30.7 71.1 77.9 76.7 answer-answer 68.7 65.1 64.3 - - - - - - - - 61.1 - - headlines 71.7 75.0 73.4 - - - - - - - - 68.6 - - plagiarism 84.4 84.8 83.7 - - - - - - - - 80.5 - - postediting 85.3 84.3 85.9 - - - - - - - - 81.9 - - question-question 68.9 74.1 66.4 - - - - - - - - 64.0 - - STS’16 75.8 76.7 74.8 - - - - - - - - 71.2 - - SICK’14 74.7 74.9 72.8 69.4 69.4 72.2 67.9 70.1 61.5 49.8 - 73.4 72.9 71.3

Table 1: Results on unsupervised evaluation tasks (Pearson’s ). Bold numbers are the best results among unsupervised transfer models, and underlined numbers are the best ones among all models.
Datasets FastSent Hill2016LearningDR QT logeswaran2018an Multi-view
+AE RNN BOW C1 C2 C3
deft-forum 41.0 41.0 15.0 37.0 51.0 51.0 44.3
deft-news 58.0 56.0 48.0 62.0 67.6 73.3 72.0
deft-headlines 57.0 58.0 48.0 60.0 66.8 71.8 68.4
images 74.0 63.0 53.0 76.0 83.1 86.2 84.1
OnWN 74.7 69.0 53.0 76.0 84.2 84.1 81.7
tweet-news 63.0 70.0 62.0 67.0 76.1 75.8 73.4
STS14 61.2 59.5 49.0 65.0 71.5 73.4 70.7
Table 2: Comparison with FastSent and QT on every datasets in STS14 (Pearson’s ).

-0cm

BookCorpus Hrs Avg of STS tasks Avg of Avg of Binary-CLS tasks MRPC
(C1) (STS12-16) SICK-R, STS-B (MR, CR, SUBJ, MPQA, SST)

 

Multi-View with and :
3 64.4 81.3 86.5 75.1/82.7
68.7 82.6 85.2 74.0/81.5
68.9 82.6 87.0 77.1/83.4

 

Multi-view with and :
Multi-view with and :
6 62.3 (2.1) 81.6 (0.3) 86.3 (0.2) 75.7/83.7
63.1 (5.8) 81.9 (0.7) 87.1 (0.1) 76.7/83.3
1 68.2 (0.5) 81.8 (0.8) 85.0 (0.2) 73.7/81.7
69.0 (0.1) 78.0 (4.6) 85.3 (1.7) 73.8/82.6
7 68.6 (0.3) 82.6 () 87.0 () 76.5/84.2
Single-view with only:
Single-view with only:
2.5 60.9 (3.5) 81.5 (0.2) 86.2 (0.3) 74.7/82.6
1 68.7 () 82.3 (0.3) 85.1 (0.1) 72.8/81.8
3.5 67.2 (1.7) 82.7 (0.1) 87.1 (0.1) 76.4/83.2
Multi-View with and :
3 61.3 (3.1) 80.6 (0.7) 86.5 () 74.8/81.9
68.2 (0.5) 82.6 () 84.8 (0.4) 74.7/82.2
64.5 (4.4) 82.6 () 86.9 (0.1) 75.8/83.0
Multi-View with and :
3 52.9 (11.5) 77.7 (3.6) 85.2 (1.3) 74.7/82.2
67.8 (0.9) 81.8 (0.8) 84.1 (1.1) 72.7/81.9
64.6 (4.3) 82.6 () 87.0 () 77.1/83.4
Table 3: Results of our multi-view model with and and other variants. In the table, “Avg of STS tasks” refers to the mean Pearson’s score on 5 STS tasks; “Avg of SICK-R, STS-B” refers to the mean Pearson’s score on Sick-Entailment and STS-Benchmark as they both require the same feature engineering methods proposed in Tai2015ImprovedSR ; “Avg of Binary-CLS tasks” refers to the mean accuracy on 5 sentiment analysis tasks; “MRPC” refers to the Microsoft Paraphrase Detection task, and the results are reported in Accuracy/F1-score. stands for an ensemble of 2 representations. Better view in colour.

3 Multi-view Learning vs. Single-view Learning

In order to show that the multi-view learning with and is helping the learning, we compare our model with other variants, including the multi-view model with 2 functions of the same type but parameterised independently, either 2 -s or 2 -s, and the single-view model with only one or . The results of models trained on BookCorpus with different settings are presented in in Table 3.

The results also support our claim that our multi-view learning with 2 different views improves each view in single-view learning, and also performs better than the multi-view models with the same architecture but parameterised separately.

Generally, ensemble produces better results on supervised evaluation tasks. However, only in our multi-view learning framework with 2 distinctive encoders, an ensemble of 2 representations provides better performance on STS tasks. The performance of an ensemble of 2 representations in other variants is inferior to that of the linear encoding function itself.

4 Training & Model Details

The hyperparameters we need to tune include the batch size , the dimension of the GRU encoder , and the context window , and the results we presented in this paper is based on the model trained with , , and . It takes up to 8GB on a GTX 1080Ti GPU.

The initial learning rate is , and we didn’t anneal the learning rate through the training. All weights in the model are initialised using the method proposed in He2015DelvingDI , and all gates in the bi-GRU are initialised to 1, and all biases in the single-layer neural network are zeroed before training. The word vectors are fixed to be those in the FastText Bojanowski2017EnrichingWV , and we don’t finetune them. Words that are not in the FastText’s vocabulary are fixed to vectors through training. The temperature term is initialised as , and is tuned by the gradient descent during training.

The temperature term is used to convert the agreement

to a probability distribution

in Eq. 1 in the main paper. In our experiments, is a trainable parameter initialised to that decreased consistently through training. Another model trained with fixed set to the final value performed similarly.

5 Number of Parameters

The number of parameters of each of the selected models is:

  1. Ours:

  2. Quick-thought logeswaran2018an :

  3. Skip-thought Kiros2015SkipThoughtV :