Unsupervised sentence encoders are often trained on language modeling based tasks where the encoded sentence representations are used to reconstruct the input sentence Hill et al. (2016) or generate neighboring sentences Kiros et al. (2015); Hill et al. (2016). The trained encoders produce sentence representations that achieve the best performance on many sentence-level prediction tasks Hill et al. (2016).
However, training encoders using such language modeling based tasks is difficult. Language model prediction over large vocabularies across large contexts often means having large models (at least in the output layers), requiring large training data and long training times.
Instead, we introduce an unsupervised discriminative training task, fake sentence detection. The main idea is to generate fake sentences by corrupting an original sentence. We use two methods to generate fake sentences: word shuffling where we swap the positions of two words at random and word dropping, where we drop a word at random from the original sentence. The resulting fake sentences are mostly similar to the original sentences – a fake sentence differs from its source in at most two word positions. Given an source corpus of unlabeled English sentences, we build a new collection of sentences by creating multiple fake sentences for every sentence in the source corpus. The training task is then to take any given sentence from this new collection as input and predict whether it is a real or fake sentence.
This training task formulation has two key advantages: (i) This binary classification task can be modeled with fewer parameters in the output layer and can be trained more efficiently compared to the language modeling training tasks where the output layer has many parameters depending on the vocabulary size. (ii) The task forces the encoder to track both syntax and semantics. Swapping words, for instance, can not only break syntax, but can also lead to a sentence that is semantically incoherent or less plausible (e.g., “John reached Chicago.” versus “Chicago reached John”).
We train a bidirectional long short term memory network (BiLSTM) encoder that produces a representation of the input sentence, which is fed to a three-layer feed-forward network for prediction. We then evaluate this trained encoderwithout any further tuning on multiple sentence-level tasks and test for syntactic and semantic properties which demonstrate the benefits of fake sentence training.
In summary, this paper makes the following contributions: 1) Introduces fake sentence detection as an unsupervised training task for learning sentence encoders that can distinguish between small changes in mostly similar sentences. 2) An empirical evaluation on multiple sentence-level tasks showing representations trained on the fake sentence tasks outperform a strong baseline model trained on language modeling tasks, even when training on small amounts of data (1M vs. 64M sentences) reducing training time from weeks to within 20 hours.
2 Related Work
Previous sentence encoding approaches can be broadly classified as supervisedConneau et al. (2017); Cer et al. (2018); Marcheggiani and Titov (2017); Wieting et al. (2015), unsupervised Kiros et al. (2015); Hill et al. (2016) or semi-supervised approaches 4582; Peters et al. (2018); Dai and Le (2015); Socher et al. (2011)
. The supervised approaches train the encoders on tasks such as NLI and use transfer learning to adapt the learned encoders to different downstream tasks. The unsupervised approaches extend the skip-gramMikolov et al. (2013) to the sentence level, and use the sentence embedding to predict the adjacent sentences. Skipthought Kiros et al. (2015)
uses a BiLSTM encoder to obtain a fixed length embedding for a sentence, and uses a BiLSTM decoder to predict adjacent sentences. Training Skipthought model is expensive, and one epoch of training on the Toronto BookCorpusZhu et al. (2015) dataset takes more than two weeks Hill et al. (2016) on a single GPU. FastSent Hill et al. (2016) uses embeddings of a sentence to predict words from the adjacent sentences. A sentence is represented by simply summing up the word representation of all the words in the sentence. FastSent requires less training time than Skipthought, but FastSent has worse performance. Semi-supervised approaches train sentence encoders on large unlabeled datasets, and do a task specific adaptation using labeled data.
In this work, we propose an unsupervised sentence encoder that takes around 20 hours to train on a single GPU, and outperforms Skipthought and FastSent encoders on multiple downstream tasks. Unlike the previous unsupervised approaches, we use the binary task of real versus fake sentence classification to train a BiLSTM based sentence encoder.
3 Training Tasks for Encoders
We propose a discriminative task for training sentence encoders. The key bottleneck in training sentence encoders is the need for large amounts of labeled data. Prior work use language modeling as a training task leveraging unlabeled text data. Encoders are trained to produce sentence representations which are effective at either generating neighboring sentences (e.g., Skipthought Kiros et al. (2015) or at least effective at predict the words in the neighboring sentences Hill et al. (2016). The challenge becomes one of balance between model coverage (i.e. the number of output words it can predict) and model complexity (i.e. the number of parameters needed for prediction).
Rather than address the language modeling challenges, we propose a simpler training task that requires making a single prediction over an input sentence. In particular, we propose to learn a sentence encoder by training a sequential model to solve the binary classification task of detecting whether a given input sentence is fake or real. This real-fake sentence classification task would perhaps be trivial if the fake sentences look very different from the real sentences. We propose two simple methods to generate noisy sentences which look mostly similar to real sentences. We describe the noisy sentence generation strategies in Section 3.1. Thus, we create a labeled dataset of real and fake sentences, and train a sequential model to distinguish between real and fake sentences, which results in a model whose classification layer has far fewer parameters than previous language model based encoders. Our model architecture is described in Section 3.2.
3.1 Fake Sentence Generation
For a sentence comprising of words, we consider two strategies to generate a noisy version of the sentence: 1) WordShuffle: randomly sample two indices and corresponding to words and in , and shuffle the words to obtain the noisy sentence . Noisy sentence would be of the same length as the original sentence . 2) WordDrop: randomly pick one index corresponding to word and drop the word from the sentence to obtain . Note there can be many variants for this strategy but here we experiment with this basic choice.
3.2 Real Versus Fake Sentence Classification
shows the proposed architecture of our fake sentence classifier with an encoder and a Multi-layer Perceptron(MLP) with 2 hidden layers. The encoder consists of a bidirectional LSTM followed by a max pooling layer. At each time step we concatenate the forward and backward hidden states to get, ). We apply max-pooling to these concatenated hidden states to get a fixed length representation (), which we then use as input to a MLP for classifying into real/fake classes.
4 Evaluation Setup
Downstream Tasks: We compare the sentence encoders trained on a large collection (BookCorpus Zhu et al. (2015)) by testing them on multiple sentence level classification tasks (MR, CR, SUBJ, MPQA, TREC, SST) and one NLI task defined over sentence-pairs (SICK). We also evaluate the sentence representations for image and caption retrieval tasks on the COCO dataset Lin et al. (2014). We use the same evaluation protocol and dataset split as Karpathy and Fei-Fei (2015); Conneau et al. (2017). Table 1 lists the classification tasks and the datasets. We also compare the sentence representations for how well they capture important syntactic and semantic properties using probing classification tasks Conneau et al. (2018)
. For all downstream and probing tasks, we use the encoders to obtain representation for all the sentences, and train logistic regression classifiers on the training split. We tune the-norm regularizer using the validation split, and report the results on the test split.
Training Corpus: The FastSent and Skipthought encoders are trained on the full Toronto BookCorpus of 64M sentences Zhu et al. (2015). Our models, however, train on a much smaller subset of only 1M sentences.
Results on downstream tasks: Bold face indicates best result and underlined results show when fake sentence training is better than Skipthought (full). COCO-Cap and COCO-Img are caption and image retrieval tasks on COCO. We report Recall@5 for the COCO retrieval tasks.
Probing task accuracies. Tasks: SentLen: predict sentence length, WC: is word in sentence, TreeDepth: depth of syntactic tree, TopConst: predict top-level constituent, BShift: is bigram in flipped in sentence, Tense: predict tense of word, Subj(Obj)Num: singular or plural subject, SOMO: semantic odd man out, CoordInv: is co-ordination is inverted.
Sentence Encoder Implementation: Our sentence encoder architecture is the same as the BiLSTM-max model Conneau et al. (2017). We represent words using 300-d pretrained Glove embeddings Pennington et al. (2014)
. We use a single layer BiLSTM model, with 2048-d hidden states. The MLP classifier we use for fake sentence detection has two hidden layers with 1024 and 512 neurons. We train separate models for word drop and word shuffle. The models are trained for 15 epochs with a batch size of 64 using SGD algorithm, when training converges with a validation set accuracy of 87.2 for word shuffle. The entire training completes in less than 20 hours on a single GPU machine.
Classification and NLI:
Results are shown in Table 2. Both fake sentence training tasks yield better performance on five out of the seven language tasks when compared to Skipthought (full), i.e., even when it is trained on the full BookCorpus. Word drop and word shuffle performances are mostly comparable. The Skipthought (1M) row shows that training on a sentence-level language modeling task can fare substantially worse when trained on a smaller subset of data. FastSent, while easier to train and has faster training cycles, is better than Skipthought (1M) but is worse than the full Skipthought model.
On both caption and image retrieval tasks (last 2 columns of Table 2), fake sentence training with word dropping and word shuffle are better than the published Skipthought results.
Table 3 compares sentence encoders using the recently proposed probing tasks Conneau et al. (2018). The goal of each task is to use the input sentence encoding to predict a particular syntactic or semantic property of the original sentence it encodes (e.g., predict if the sentence contains a specific word). Encodings from fake sentence training score higher in six out of the ten tasks. WordShuffle encodings are significantly better than Skipthought in some semantic properties: tracking word content (WC), bigram shuffles (BShift), semantic odd man out (SOMO). Skipthought and WordShuffle are comparable on syntactic properties: agreement (SubjNum, ObjNum, Tense, and CoordInv). The only exception is TreeDepth, where WordShuffle is substantially better. Table 4 shows examples of the BShift task and cases where the word shuffle and Skipthought models fail. In general we find that word shuffle works better when shifted bigrams involve prepositions, articles, or conjunctions.
|It shone the in light .||✓|
|I seized the and sword leapt||✓|
|to the window .|
|Once again Amadeus held out||✓||✓|
|arm his .|
|When we get inside , I know that||✓|
|I have to leave and Marceline find .|
This work introduced an unsupervised training task, fake sentence detection, where the sentence encoders are trained to produce representations which are effective at detecting if a given sentence is an original or a fake. This leads to better performance on downstream tasks and is able to represent semantic and syntactic properties, while also reducing the amount of training needed. More generally the results suggest that tasks which test for different syntactic and semantic properties in altered sentences can be useful for learning effective representations.
- Cer et al. (2018) Daniel Cer, Yinfei Yang, Sheng yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St. John, Noah Constant, Mario Guajardo-Cespedes, Steve Yuan, Chris Tar, Yun-Hsuan Sung, Brian Strope, and Ray Kurzweil. 2018. Universal sentence encoder. CoRR, abs/1803.11175.
- Conneau et al. (2017) Alexis Conneau, Douwe Kiela, Holger Schwenk, Loic Barrault, and Antoine Bordes. 2017. Supervised learning of universal sentence representations from natural language inference data. arXiv preprint arXiv:1705.02364.
- Conneau et al. (2018) Alexis Conneau, German Kruszewski, Guillaume Lample, Loïc Barrault, and Marco Baroni. 2018. What you can cram into a single vector: Probing sentence embeddings for linguistic properties. arXiv preprint arXiv:1805.01070.
- Dai and Le (2015) Andrew M Dai and Quoc V Le. 2015. Semi-supervised sequence learning. In Advances in Neural Information Processing Systems, pages 3079–3087.
- Hill et al. (2016) Felix Hill, Kyunghyun Cho, and Anna Korhonen. 2016. Learning distributed representations of sentences from unlabelled data. arXiv preprint arXiv:1602.03483.
- Karpathy and Fei-Fei (2015) Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In
Kiros et al. (2015)
Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, Richard Zemel, Raquel Urtasun,
Antonio Torralba, and Sanja Fidler. 2015.
Skip-thought vectors.In Advances in neural information processing systems, pages 3294–3302.
- Lin et al. (2014) Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer.
- Marcheggiani and Titov (2017) Diego Marcheggiani and Ivan Titov. 2017. Encoding sentences with graph convolutional networks for semantic role labeling. arXiv preprint arXiv:1703.04826.
- Mikolov et al. (2013) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119.
Pennington et al. (2014)
Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014.
Glove: Global vectors for word representation.
Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543.
- Peters et al. (2018) Matthew E Peters, Mark Neumann, Mohit Iyyer, et al. 2018. Deep contextualized word representations.
Socher et al. (2011)
Richard Socher, Jeffrey Pennington, Eric H Huang, Andrew Y Ng, and
Christopher D Manning. 2011.
Semi-supervised recursive autoencoders for predicting sentiment distributions.In Proceedings of the conference on empirical methods in natural language processing, pages 151–161. Association for Computational Linguistics.
- Wieting et al. (2015) John Wieting, Mohit Bansal, Kevin Gimpel, and Karen Livescu. 2015. Towards universal paraphrastic sentence embeddings. arXiv preprint arXiv:1511.08198.
- Zhu et al. (2015) Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE international conference on computer vision, pages 19–27.