Exploiting Invertible Decoders for Unsupervised Sentence Representation Learning

09/08/2018 ∙ by Shuai Tang, et al. ∙ University of California, San Diego 0

The encoder-decoder models for unsupervised sentence representation learning tend to discard the decoder after being trained on a large unlabelled corpus, since only the encoder is needed to map the input sentence into a vector representation. However, parameters learnt in the decoder also contain useful information about language. In order to utilise the decoder after learning, we present two types of decoding functions whose inverse can be easily derived without expensive inverse calculation. Therefore, the inverse of the decoding function serves as another encoder that produces sentence representations. We show that, with careful design of the decoding functions, the model learns good sentence representations, and the ensemble of the representations produced from the encoder and the inverse of the decoder demonstrate even better generalisation ability and solid transferability.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Learning sentence representations from unlabelled data is becoming increasingly prevalent in both the machine learning and natural language processing research communities, as it efficiently and cheaply allows knowledge extraction that can successfully transfer to downstream tasks. Methods built upon the distributional hypothesis

Harris (1954) and distributional similarity Firth (1957) can be roughly categorised into two types:

Generative objective: These models generally follow the encoder-decoder structure and learn to encode the current sentence and decode the ones in the adjacent context Kiros et al. (2015); Gan et al. (2017); Tang et al. (2018). As the focus is on learning representations, the quality of generated sequences is not the main concern, thus the decoder is usually discarded after learning.

Discriminative objective:

A classifier is learnt on top of the encoders to distinguish adjacent sentences from those that are not

Li and Hovy (2014); Jernite et al. (2017); Nie et al. (2017); Logeswaran and Lee (2018); these models make a prediction using a predefined differential similarity function on the representations of the input sentence pairs or triplets. The similarity function, and the effective negative samples, which are pairs of non-adjacent sentences, crucially determine the quality of learnt representations.

Our goal is to exploit invertible decoding functions, which can then be used as additional encoders during testing. The contribution of our work is summarised as follows:

  1. The decoder is used in testing to produce representations of sentences. With careful design, the inverse function of the decoder is easy to derive with no expensive inverse calculation.

  2. The inverse function of the decoder naturally behaves differently from the encoder; thus the representations from both functions complement each other and an ensemble of both provides good results on downstream tasks.

2 Related Work

Learning vector representations for words with a word embedding matrix as the encoder and a context word embedding matrix as the decoder Mikolov et al. (2013a); Lebret and Collobert (2014); Pennington et al. (2014); Bojanowski et al. (2017) can be considered as a word-level example of our approach, as the models learn to predict the surrounding words in the context given the current word, and the context word embeddings can also be utilised to augment the word embeddings Pennington et al. (2014); Levy et al. (2015). We are thus motivated to explore the use of sentence decoders after learning instead of ignoring them as most sentence encoder-decoder models do.

Our approach is to invert the decoding function in order to use it as another encoder to assist the original encoder. In order to make computation of the inverse function well-posed and tractable, careful design of the decoder is needed. A simple instance of an invertible decoder is a linear projection with an orthonormal square matrix, whose transpose is its inverse. A family of bijective transformations with non-linear functions Dinh et al. (2014); Rezende and Mohamed (2015); Kingma et al. (2016) can also be considered as it empowers the decoder to learn a complex data distribution.

In our paper, we exploit two types of plausible decoding functions, including linear projection and bijective functions with neural networks

Dinh et al. (2014), and with proper design, the inverse of each of the decoding functions can be derived without expensive calculation after learning. Thus, the decoder function can be utilised along with the encoder for building sentence representations. We show that the ensemble of the encoder and the inverse of the decoder outperforms each of them.

3 Model Design

Our model has similar structure to that of skip-thought Kiros et al. (2015) and, given the neighbourhood hypothesis Tang et al. (2017), learns to decode the next sentence given the current one.

3.1 Training Objective

Given the finding Tang et al. (2018) that neither an autoregressive nor an RNN decoder is necessary for learning sentence representations that excel on downstream tasks, our model only learns to predict words in the next sentence. Suppose that the -th sentence has words, and has words. The training objective is to maximise the averaged log-likelihood for all sentence pairs:

where and contain the parameters in the encoder and the decoder

respectively. Since calculating the probability of decoding each word involves a computationally demanding softmax function, the negative sampling method

Mikolov et al. (2013a) is applied to replace the softmax, and is calculated as:

where is the pretrained vector representation from FastText Bojanowski et al. (2017) for , is the output of the decoder , and the empirical distribution is the unigram distribution raised to power 0.75 Mikolov et al. (2013b). For simplicity, we omit the subscription for indexing the sentences.

3.2 Encoder

The encoder is a bi-directional Gated Recurrent Unit

Chung et al. (2014) with -dimensions in each direction. It processes word vectors in a input sentence one at a time, and generates a sequence of hidden states. During training, only the last hidden state serves as the sentence representation .

3.3 Decoder

The goal is to reuse the decoding function rather than ignoring it for building sentence representations after learning, thus one possible solution is to find the inverse function of the decoder function, which is noted as . In order to reduce the complexity and the running time during both training and testing, the decoding function needs to be easily invertible. Here, two types of decoding functions can be considered.

Bojanowski et al. (2017) Arora et al. (2017) Mu et al. (2018) Conneau et al. (2017) Wieting and Gimpel (2018)
Agirre et al. (2012) Agirre et al. (2013) Agirre et al. (2014) Agirre et al. (2015) Agirre et al. (2016) Marelli et al. (2014)
Task Un. Transfer Semi. Su.

Multi-view fastText GloVe word2vec PSL Infer ParaNMT
C1 C2 C3 avg WR tfidf WR proc. bow proc. avg WR Sent (concat.)
STS12 60.0 61.3 60.1 58.3 58.8 58.7 56.2 54.1 57.2 57.7 52.8 59.5 58.2 67.7
STS13 60.5 61.8 60.2 51.0 59.9 52.1 56.6 57.7 56.8 58.0 46.4 61.8 48.5 62.8
STS14 71.1 72.1 71.5 65.2 69.4 63.8 68.5 59.2 62.9 63.3 59.5 73.5 67.1 76.9
STS15 75.7 76.9 75.5 67.7 74.2 60.6 71.7 57.3 62.7 63.4 60.0 76.3 71.1 79.8
STS16 75.4 76.1 75.1 64.3 72.4 - - - - - - - 71.2 76.8
SICK14 73.8 73.6 72.7 69.8 72.3 69.4 72.2 67.9 70.1 61.5 66.4 72.9 73.4 -
Table 1: Results on unsupervised evaluation tasks. Bold numbers are the best results among unsupervised transfer models, and underlined numbers are the best ones among all models. Our approach with a invertible linear decoder demonstrates stronger transferability than other unsupervised tranfer methods.

3.3.1 Linear Projection

The simplest decoding function is a linear projection, which is , where is a trainable weight matrix, and usually . Thus, the right inverse of exists when is full-rank, and the inverse function is:

Since is a surjective function, the inverse given by the pseudoinverse of is not well-defined without use of another constraint. Here, a row-wise orthonormal regularisation on is applied during training, which leads to , where

is the identity matrix, thus the inverse function is simply

, which is easily computed. The regularisation formula is , where is the Frobenius norm. Specifically, the update rule Cissé et al. (2017) for the regularisation is:

where is set to

. After learning, all 300 singular values of

are very close to .

3.3.2 Bijective Functions

A general case is to use a bijective function as the decoder, as the bijective functions are naturally invertible. A family of bijective transformation was designed in NICE Dinh et al. (2014), and the simplest continuous bijective function and its inverse is defined as:

where is a -dimensional partition of the input , and

is an arbitrary continuous function, which could be a trainable multi-layer feedforward neural network with non-linear activation functions.

The requirement of the continuous bijective transformation is that, the dimensionality of the input and the output need to match exactly. While in our case, the output of the decoding function has lower dimensionality than the input does. Our solution is to add an orthonormal regularised linear projection on top of the bijective function to transform the output to the desired dimension.111Noted as “Bijection+Linear” in the result tables.

3.4 Using Decoder in the Test Phase

As the decoder is easily invertible, it is also used to produce vector representations. The post-processing step Arora et al. (2017) that removes the top principal component is applied on the representations from and individually. In the following sections, denotes the post-processed representation from , and from . Since and naturally process sentences in distinctive ways, it is reasonable to expect that the ensemble of and will outperform each of them.

4 Experimental Design

The experiments are conducted in PyTorch

Paszke et al. (2017), and the evaluation is done using the SentEval package Conneau et al. (2017) with modifications to include the post-processing step. Word vectors are initialised with FastText Bojanowski et al. (2017), and fixed during training.

Unlabelled Corpora: Three unlabelled corpora, including BookCorpus Zhu et al. (2015), UMBC News Corpus Han et al. (2013) and Amazon Book Review McAuley et al. (2015), are used to train models with invertible decoders. These corpora are referred as B, U and A in Table 1 and 2.

Evaluation Tasks:

The evaluation tasks contain 6 unsupervised tasks, in which the similarity of two sentences is determined by the cosine similarity of their vector representations, and 9 supervised tasks, in which a linear model is learnt on the training set in each of the downstream tasks to make predictions for the test set.

The hyperparameters are tuned on the averaged scores on STS14 of the model trained on

BookCorpus, thus it is marked with a in tables.

Conneau et al. (2017);Hill et al. (2016); Kiros et al. (2015);Ba et al. (2016);Gan et al. (2017);Jernite et al. (2017);Nie et al. (2017)
Zhao et al. (2015);Logeswaran and Lee (2018);Marelli et al. (2014);Dolan et al. (2004);Li and Roth (2002)
Pang and Lee (2005);Hu and Liu (2004);Pang and Lee (2004);Wiebe et al. (2005);Socher et al. (2013)

 

Model Hrs SICK-R SICK-E MRPC TREC MR CR SUBJ MPQA SST

 

Supervised task-dependent training - No transfer learning

AdaSent - - - - 92.4 83.1 86.3 95.5 93.3 -
TF-KLD - - - 80.4/85.9 - - - - - -
Supervised training - Transfer learning
InferSent 24 88.4 86.3 76.2/83.1 88.2 81.1 86.3 92.4 90.2 84.6
Unsupervised training with ordered sentences
FastSent 2 - - 72.2/80.3 76.8 70.8 78.4 88.7 80.6 -
FastSent+AE 2 - - 71.2/79.1 80.4 71.8 76.5 88.8 81.5 -
ST 336 85.8 82.3 73.0/82.0 92.2 76.5 80.1 93.6 87.1 82.0
ST+LN 720 85.8 79.5 - 88.4 79.4 83.1 93.7 89.3 82.9
CNN-LSTM† - 86.2 - 76.5/83.8 92.6 77.8 82.1 93.6 89.4 -
DiscSent‡ 8 - - 75.0/ - 87.2 - - 93.0 - -
DisSent ‡ - 79.1 80.3 - / - 84.6 82.5 80.2 92.4 89.6 82.9
MC-QT 11 86.8 - 76.9/84.0 92.8 80.4 85.2 93.9 89.4 -
B - Bijection+Linear 5 87.3 83.3 74.6/83.0 87.4 80.4 82.2 94.1 89.0 83.7
B - Bijection+Linear 85.1 81.2 74.1/82.0 83.0 78.5 80.6 92.5 88.1 82.0
B - Bijection+Linear 87.7 85.1 75.9/83.4 89.8 80.9 82.7 94.4 89.0 84.2
B - Linear 3.5 86.9 83.4 75.1/83.4 89.4 80.4 82.8 94.0 89.1 84.6
B - Linear 87.3 84.3 73.7/82.5 88.2 79.0 82.0 93.5 88.8 82.8
B - Linear 88.1 85.2 76.5/83.7 90.0 81.3 83.5 94.6 89.5 85.9
U - Linear 9 87.8 85.9 77.5/83.8 92.2 81.3 83.4 94.7 89.5 85.9
A - Linear 9 87.7 84.4 76.0/83.7 90.6 84.0 85.6 95.3 89.7 88.7
Table 2: Results on supervised evaluation tasks. Bold numbers are the best results among unsupervised transfer models with ordered sentences, and underlined numbers are the best ones among all models.

5 Discussion

The inverse functions of our two decoders perform well on all tasks, and the linear decoder performs slightly better than the bijection decoder. The comparison is made based on the models trained on the BookCorpus, and the results are presented in Tables 2 and 3.

Task B - Linear B - Bijection+Linear
STS12 55.7 58.8 60.0 55.2 55.3 58.3
STS13 55.7 59.5 60.5 54.6 50.9 58.2
STS14 65.6 69.2 71.1 64.9 67.6 70.4
STS15 71.0 73.9 75.7 71.3 69.1 74.0
STS16 71.8 72.3 75.4 71.2 69.3 74.5
SICK14 69.8 72.2 73.8 69.8 70.8 72.8
Table 3: Comparison between the linear decoder and the bijection decoder on unsupervised tasks. The model with a linear decoder slightly outperforms the one with the bijective function. In both cases, an ensemble of the encoder and the inverse of the decoder performs better than each of them.

An ensemble of and the demonstrates higher performance as shown in Tables 2 and 3. As the and encode the input sentence in distinctive ways, an ensemble of them contains richer information, which leads to better results on the downstream tasks.

The results of our model with an invertible linear decoder trained on all three corpora and related work are presented in Tables 1 and 2. The ensemble of and either outperforms existing transfer learning methods or provides comparable results with the best ones.

6 Conclusion

Two types of decoders, including an orthonormal regularised linear projection and a bijective function, whose inverses can be derived effortlessly, are presented in order to utilise the decoder as another encoder in the testing phase. The experiments and comparisons are conducted on three large unlabelled corpora, and the performance on the downstream tasks shows the high usability and generalisation ability of the decoders in testing. Furthermore, an ensemble of the original encoder and the inverse function of the decoder gives improved results that are better than each alone. We view this as unifying the generative and discriminative objectives for unsupervised sentence representation learning, as it is trained with a generative objective which when inverted can be seen as creating a discriminative target. Also all components are used in downstream testing as encoders. Future research can extend our framework to tasks that require training with generative objectives.

References

  • Agirre et al. (2015) Eneko Agirre, Carmen Banea, Claire Cardie, Daniel M. Cer, Mona T. Diab, Aitor Gonzalez-Agirre, Weiwei Guo, Iñigo Lopez-Gazpio, Montse Maritxalar, Rada Mihalcea, German Rigau, Larraitz Uria, and Janyce Wiebe. 2015. Semeval-2015 task 2: Semantic textual similarity, english, spanish and pilot on interpretability. In SemEval@NAACL-HLT.
  • Agirre et al. (2014) Eneko Agirre, Carmen Banea, Claire Cardie, Daniel M. Cer, Mona T. Diab, Aitor Gonzalez-Agirre, Weiwei Guo, Rada Mihalcea, German Rigau, and Janyce Wiebe. 2014. Semeval-2014 task 10: Multilingual semantic textual similarity. In SemEval@COLING.
  • Agirre et al. (2016) Eneko Agirre, Carmen Banea, Daniel M. Cer, Mona T. Diab, Aitor Gonzalez-Agirre, Rada Mihalcea, German Rigau, and Janyce Wiebe. 2016. Semeval-2016 task 1: Semantic textual similarity, monolingual and cross-lingual evaluation. In SemEval@NAACL-HLT.
  • Agirre et al. (2012) Eneko Agirre, Daniel M. Cer, Mona T. Diab, and Aitor Gonzalez-Agirre. 2012. Semeval-2012 task 6: A pilot on semantic textual similarity. In SemEval@NAACL-HLT.
  • Agirre et al. (2013) Eneko Agirre, Daniel M. Cer, Mona T. Diab, Aitor Gonzalez-Agirre, and Weiwei Guo. 2013. *sem 2013 shared task: Semantic textual similarity. In *SEM@NAACL-HLT.
  • Arora et al. (2017) Sanjeev Arora, Yingyu Liang, and Tengyu Ma. 2017. A simple but tough-to-beat baseline for sentence embeddings. In International Conference on Learning Representations.
  • Ba et al. (2016) Jimmy Ba, Ryan Kiros, and Geoffrey E. Hinton. 2016. Layer normalization. CoRR, abs/1607.06450.
  • Bojanowski et al. (2017) Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with subword information. TACL, 5:135–146.
  • Chung et al. (2014) Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555.
  • Cissé et al. (2017) Moustapha Cissé, Piotr Bojanowski, Edouard Grave, Yann Dauphin, and Nicolas Usunier. 2017. Parseval networks: Improving robustness to adversarial examples. In ICML.
  • Conneau et al. (2017) Alexis Conneau, Douwe Kiela, Holger Schwenk, Loïc Barrault, and Antoine Bordes. 2017. Supervised learning of universal sentence representations from natural language inference data. In EMNLP.
  • Dinh et al. (2014) Laurent Dinh, David Krueger, and Yoshua Bengio. 2014. Nice: Non-linear independent components estimation. CoRR, abs/1410.8516.
  • Dolan et al. (2004) William B. Dolan, Chris Quirk, and Chris Brockett. 2004. Unsupervised construction of large paraphrase corpora: Exploiting massively parallel news sources. In COLING.
  • Firth (1957) J. R. Firth. 1957. A synopsis of linguistic theory.
  • Gan et al. (2017) Zhe Gan, Yunchen Pu, Ricardo Henao, Chunyuan Li, Xiaodong He, and Lawrence Carin. 2017.

    Learning generic sentence representations using convolutional neural networks.

    In EMNLP.
  • Han et al. (2013) Lushan Han, Abhay L Kashyap, Tim Finin, James Mayfield, and Jonathan Weese. 2013. Umbc_ebiquity-core: semantic textual similarity systems. In Second Joint Conference on Lexical and Computational Semantics (* SEM), Volume 1: Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity, volume 1, pages 44–52.
  • Harris (1954) Zellig S Harris. 1954. Distributional structure. Word, 10(2-3):146–162.
  • Hill et al. (2016) Felix Hill, Kyunghyun Cho, and Anna Korhonen. 2016.

    Learning distributed representations of sentences from unlabelled data.

    In HLT-NAACL.
  • Hu and Liu (2004) Minqing Hu and Bing Liu. 2004. Mining and summarizing customer reviews. In KDD.
  • Jernite et al. (2017) Yacine Jernite, Samuel R. Bowman, and David Sontag. 2017. Discourse-based objectives for fast unsupervised sentence representation learning. CoRR, abs/1705.00557.
  • Kingma and Ba (2014) Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
  • Kingma et al. (2016) Diederik P. Kingma, Tim Salimans, and Max Welling. 2016. Improving variational inference with inverse autoregressive flow. CoRR, abs/1606.04934.
  • Kiros et al. (2015) Jamie Ryan Kiros, Yukun Zhu, Ruslan Salakhutdinov, Richard S. Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Skip-thought vectors. In NIPS.
  • Le and Mikolov (2014) Quoc V. Le and Tomas Mikolov. 2014. Distributed representations of sentences and documents. In ICML.
  • Lebret and Collobert (2014) Rémi Lebret and Ronan Collobert. 2014. Word embeddings through hellinger pca. In EACL.
  • Levy et al. (2015) Omer Levy, Yoav Goldberg, and Ido Dagan. 2015. Improving distributional similarity with lessons learned from word embeddings. TACL, 3:211–225.
  • Li and Hovy (2014) Jiwei Li and Eduard H. Hovy. 2014. A model of coherence based on distributed sentence representation. In EMNLP.
  • Li and Roth (2002) Xin Li and Dan Roth. 2002. Learning question classifiers. In COLING.
  • Logeswaran and Lee (2018) Lajanugen Logeswaran and Honglak Lee. 2018. An efficient framework for learning sentence representations. In International Conference on Learning Representations.
  • Marelli et al. (2014) Marco Marelli, Stefano Menini, Marco Baroni, Luisa Bentivogli, Raffaella Bernardi, and Roberto Zamparelli. 2014. A sick cure for the evaluation of compositional distributional semantic models. In LREC.
  • McAuley et al. (2015) Julian J. McAuley, Christopher Targett, Qinfeng Shi, and Anton van den Hengel. 2015. Image-based recommendations on styles and substitutes. In SIGIR.
  • McCann et al. (2017) Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. 2017. Learned in translation: Contextualized word vectors. In NIPS.
  • Mikolov et al. (2013a) Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013a. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
  • Mikolov et al. (2013b) Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. 2013b. Distributed representations of words and phrases and their compositionality. In NIPS.
  • Mu et al. (2018) Jiaqi Mu, Suma Bhat, and Pramod Viswanath. 2018. All-but-the-top: Simple and effective postprocessing for word representations. In International Conference on Learning Representations.
  • Nie et al. (2017) Allen Nie, Erin D. Bennett, and Noah D. Goodman. 2017. Dissent: Sentence representation learning from explicit discourse relations. CoRR, abs/1710.04334.
  • Pang and Lee (2004) Bo Pang and Lillian Lee. 2004.

    A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts.

    In ACL.
  • Pang and Lee (2005) Bo Pang and Lillian Lee. 2005. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In ACL.
  • Pascanu et al. (2013) Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. 2013.

    On the difficulty of training recurrent neural networks.

    In ICML.
  • Paszke et al. (2017) Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in pytorch. In NIPS-W.
  • Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global vectors for word representation. In EMNLP.
  • Rezende and Mohamed (2015) Danilo Jimenez Rezende and Shakir Mohamed. 2015. Variational inference with normalizing flows. In ICML.
  • Saxe et al. (2013) Andrew M. Saxe, James L. McClelland, and Surya Ganguli. 2013. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. CoRR, abs/1312.6120.
  • Socher et al. (2013) Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In EMNLP.
  • Tai et al. (2015) Kai Sheng Tai, Richard Socher, and Christopher D. Manning. 2015.

    Improved semantic representations from tree-structured long short-term memory networks.

    In ACL.
  • Tang et al. (2017) Shuai Tang, Hailin Jin, Chen Fang, Zhaowen Wang, and Virginia R. de Sa. 2017. Rethinking skip-thought: A neighborhood based approach. In Rep4NLP@ACL.
  • Tang et al. (2018) Shuai Tang, Hailin Jin, Chen Fang, Zhaowen Wang, and Virginia R. de Sa. 2018. Speeding up context-based sentence representation learning with non-autoregressive convolutional decoding. In Rep4NLP@ACL.
  • Wiebe et al. (2005) Janyce Wiebe, Theresa Wilson, and Claire Cardie. 2005. Annotating expressions of opinions and emotions in language. Language Resources and Evaluation, 39:165–210.
  • Wieting and Gimpel (2018) John Wieting and Kevin Gimpel. 2018. Paranmt-50m: Pushing the limits of paraphrastic sentence embeddings with millions of machine translations. In ACL.
  • Zhao et al. (2015) Han Zhao, Zhengdong Lu, and Pascal Poupart. 2015. Self-adaptive hierarchical sentence model. In IJCAI.
  • Zhu et al. (2015) Yukun Zhu, Ryan Kiros, Richard S. Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. ICCV, pages 19–27.

Appendix A Training Details

The training is done in PyTorch Paszke et al. (2017). The bidirectional RNN encoder which has 1200D in each direction, the linear decoder, and the bijective function are initialised with orthonormal matrices Saxe et al. (2013). The learning rate is initialised to , and all models are trained using the Adam optimiser Kingma and Ba (2014)

with gradient clipping

Pascanu et al. (2013).

The word vectors are initialised with FastText Bojanowski et al. (2017), and the words that are not in the FastText vocabulary are initialised to vectors with all 0s. In addition, we fix the word vectors during training.

Appendix B Training Corpora

Name # of sentences
BookCorpus (B) 74M
UMBC News (U) 134.5M
Amazon Book Review (A) 150.8M
Table 4: Total number of sentences in each corpus.

Appendix C Representation Pooling

For supervised tasks, the global max-, mean-, and min- pooling functions McCann et al. (2017) run on top of hidden states in , and same operation is done on as well. The two outputs are concatenated to serve as the sentence representation.

For unsupervised tasks, only global mean-pooling function runs on top of and , individually, and the two outputs are added together Pennington et al. (2014) as the sentence representation.

Appendix D Evaluation Tasks

d.1 Unsupervised Evaluation

The unsupervised tasks include five tasks from SemEval Semantic Textual Similarity (STS) in 2012-2016 Agirre et al. (2015, 2014, 2016, 2012, 2013) and the SemEval2014 Semantic Relatedness task (SICK-R) Marelli et al. (2014).

The cosine similarity between vector representations of two sentences determines the textual similarity of two sentences, and the performance is reported in Pearson’s correlation score and Spearman’s rank correlation coefficient between human-annotated labels and the model predictions on each dataset.

d.2 Supervised Evaluation

It includes Semantic relatedness (SICK) Marelli et al. (2014), paraphrase detection (MRPC) Dolan et al. (2004), question-type classification (TREC) Li and Roth (2002), movie review sentiment (MR) Pang and Lee (2005), Stanford Sentiment Treebank (SST) Socher et al. (2013), customer product reviews (CR) Hu and Liu (2004), subjectivity/objectivity classification (SUBJ) Pang and Lee (2004), opinion polarity (MPQA) Wiebe et al. (2005).

In these tasks, MR, CR, SST, SUBJ, MPQA and MRPC are binary classification tasks, TREC is a multi-class classification task. SICK and MRPC require the same feature engineering method Tai et al. (2015) in order to compose a vector from vector representations of two sentences to indicate the difference between them.

Appendix E Results

Table 5 presents more related work than the Table 2 in the main paper.

Conneau et al. (2017);Arora et al. (2017);Hill et al. (2016); Kiros et al. (2015);Ba et al. (2016);Gan et al. (2017)
Jernite et al. (2017);Nie et al. (2017);Tai et al. (2015);Zhao et al. (2015);Le and Mikolov (2014);Logeswaran and Lee (2018)

 

Model Hrs SICK-R SICK-E MRPC TREC MR CR SUBJ MPQA SST

 

Supervised task-dependent training - No transfer learning
NB-SVM - - - - - 79.4 81.8 93.2 86.3 83.1
AdaSent - - - - 92.4 83.1 86.3 95.5 93.3 -
Tree-LSTM - 86.8 - - - - - - - -
TF-KLD - - - 80.4/85.9 - - - - - -
Supervised training - Transfer learning
InferSent 24 88.4 86.3 76.2/83.1 88.2 81.1 86.3 92.4 90.2 84.6
Unsupervised training with unordered sentences
Unigram-TFIDF - - - 73.6/81.7 85.0 73.7 79.2 90.3 82.4 -
ParagraphVec 4 - - 72.9/81.1 59.4 60.2 66.9 76.3 70.7 -
word2vec BOW 2 80.3 78.7 72.5/81.4 83.6 77.7 79.8 90.9 88.3 79.7
fastText BOW - 80.0 77.9 72.4/81.2 81.8 76.5 78.9 91.6 87.4 78.8
GloVe+WR - 86.0 84.6 - / - - - - - - 82.2
GloVe BOW - 80.0 78.6 72.1/80.9 83.6 78.7 78.5 91.6 87.6 79.8
SDAE 72 - - 73.7/80.7 78.4 74.6 78.0 90.8 86.9 -
Unsupervised training with ordered sentences
FastSent 2 - - 72.2/80.3 76.8 70.8 78.4 88.7 80.6 -
FastSent+AE 2 - - 71.2/79.1 80.4 71.8 76.5 88.8 81.5 -
ST 336 85.8 82.3 73.0/82.0 92.2 76.5 80.1 93.6 87.1 82.0
ST+LN 720 85.8 79.5 - 88.4 79.4 83.1 93.7 89.3 82.9
CNN-LSTM† - 86.2 - 76.5/83.8 92.6 77.8 82.1 93.6 89.4 -
DiscSent‡ 8 - - 75.0/ - 87.2 - - 93.0 - -
DisSent ‡ - 79.1 80.3 - / - 84.6 82.5 80.2 92.4 89.6 82.9
MC-QT 11 86.8 - 76.9/84.0 92.8 80.4 85.2 93.9 89.4 -
B - Bijection+Linear 5 87.3 83.3 74.6/83.0 87.4 80.4 82.2 94.1 89.0 83.7
B - Bijection+Linear 85.1 81.2 74.1/82.0 83.0 78.5 80.6 92.5 88.1 82.0
B - Bijection+Linear 87.7 85.1 75.9/83.4 89.8 80.9 82.7 94.4 89.0 84.2
B - Linear 3.5 86.9 83.4 75.1/83.4 89.4 80.4 82.8 94.0 89.1 84.6
B - Linear 87.3 84.3 73.7/82.5 88.2 79.0 82.0 93.5 88.8 82.8
B - Linear 88.1 85.2 76.5/83.7 90.0 81.3 83.5 94.6 89.5 85.9
U - Linear 9 87.8 85.9 77.5/83.8 92.2 81.3 83.4 94.7 89.5 85.9
A - Linear 9 87.7 84.4 76.0/83.7 90.6 84.0 85.6 95.3 89.7 88.7
Table 5: Comparison on the supervised evaluation tasks. Bold numbers are the best results among unsupervised transfer models, and underlined numbers are the best ones among all models. “†” indicates an ensemble of 2 models. “‡” indicates additional labelled discourse information is required.