DeepAI
Log In Sign Up

Learning Better Universal Representations from Pre-trained Contextualized Language Models

Pre-trained contextualized language models such as BERT have shown great effectiveness in a wide range of downstream natural language processing (NLP) tasks. However, the effective representations offered by the models target at each token inside a sequence rather than each sequence and the fine-tuning step involves the input of both sequences at one time, leading to unsatisfying representation of each individual sequence. Besides, as sentence-level representations taken as the full training context in these models, there comes inferior performance on lower-level linguistic units (phrases and words). In this work, we present a novel framework on BERT that is capable of generating universal, fixed-size representations for input sequences of any lengths, i.e., words, phrases, and sentences, using a large scale of natural language inference and paraphrase data with multiple training objectives. Our proposed framework adopts the Siamese network, learning sentence-level representations from natural language inference dataset and phrase and word-level representations from paraphrasing dataset, respectively. We evaluate our model across different granularity of text similarity tasks, including STS tasks, SemEval2013 Task 5(a) and some commonly used word similarity tasks, where our model substantially outperforms other representation models on sentence-level datasets and achieves significant improvements in word-level and phrase-level representation.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

02/20/2018

SufiSent - Universal Sentence Representations Using Suffix Encodings

Computing universal distributed representations of sentences is a fundam...
10/15/2021

Probing as Quantifying the Inductive Bias of Pre-trained Representations

Pre-trained contextual representations have led to dramatic performance ...
04/17/2021

Decrypting Cryptic Crosswords: Semantically Complex Wordplay Puzzles as a Target for NLP

Cryptic crosswords, the dominant English-language crossword variety in t...
11/07/2019

Probing Contextualized Sentence Representations with Visual Awareness

We present a universal framework to model contextualized sentence repres...
04/20/2015

Self-Adaptive Hierarchical Sentence Model

The ability to accurately model a sentence at varying stages (e.g., word...
12/28/2020

BURT: BERT-inspired Universal Representation from Learning Meaningful Segment

Although pre-trained contextualized language models such as BERT achieve...
09/10/2020

Learning Universal Representations from Word to Sentence

Despite the well-developed cut-edge representation learning for language...

1 Introduction

Representing words, phrases and sentences as low-dimensional dense vectors has always been the key to many Natural Language Processing (NLP) tasks. Previous language representation learning methods can be divided into two different categories based on language units they focus on, and therefore are suitable for different situations. High-quality word vectors derived by word embedding models

(Mikolov et al., 2013a; Pennington et al., 2014; Joulin et al., 2016) are good at measuring syntactic and semantic word similarities and significantly benefit a lot of natural language processing models. Later proposed sentence encoders (Conneau et al., 2017; Subramanian et al., 2018; Cer et al., 2018) aim to learn generalized fixed-length sentence representations in a supervised or multi-task manner, obtaining substantial results on multiple transfer tasks. Nevertheless, these models focus on either words or sentences, achieving encouraging performance at one level of linguistic unit but less satisfactory results at other levels.

Recently, contextualized representations such as ELMo, OpenAI GPT, BERT, XLNet and ALBERT (Peters et al., 2018; Radford et al., 2018; Devlin et al., 2019; Yang et al., 2019) (Lan et al., 2019) are expected to capture complex features (syntax and semantics) for sequences of any length. Especially, BERT improves the pre-training and fine-tuning scenario, obtaining new state-of-the-art results on multiple sentence-level tasks at that time. On the basis of BERT, ALBERT introduces three techniques to reduce memory consumption and training time: decomposing embedding parameters into smaller matrices, sharing parameters cross layers and replacing the next sentence prediction (NSP) task with a sentence-order prediction (SOP) task. In the fine-tuning procedure of both models, the [CLS] token is considered to be the final representation for the input sentence pair. Despite its effectiveness, these representations are token-based and the model requires both sequences to be encoded at one time, leading to unsatisfying representation of a individual sequence. Most importantly, there is a huge gap in representing linguistic units of different lengths. Lower-level linguistic units such as phrases and words are not well handled as pre-trained word embeddings do.

In this paper, we propose to learn universal representations for different-sized linguistic units (including words, phrases and sentences) through multi-task supervised training on two kinds of datasets: NLI (Bowman et al., 2015; Williams et al., 2018) and the Paraphrase Database (PPDB) (Ganitkevitch et al., 2013)

. The former is usually used as a sentence-pair classification task to develop semantic representations of sentences. The latter contains a large number of paraphrases, which in our experiments are considered phrase-level and word-level paraphrase identification and pairwise text classification tasks. In order for the model to learn the representation of a single sequence, we use BERT or ALBERT to encode each word, phrase and sentence separately, and then apply mean-pooling to transform the hidden states into a fixed-length vector. Finally, for each sequence pair, the concatenation of the two vectors is fed into a softmax layer for classification. As experiments reveals, our multi-task learning framework combines characteristics of different training objectives with respect to linguistic units of variable lengths.

The model is evaluated on multiple levels of semantic similarity tasks. In addition to standard datasets, we sample pairs of phrases from the Paraphrase Database to construct an additional phrase similarity test set. Results show that out model substantially outperforms sentence representation models including Skip-thought vectors (Kiros et al., 2015), InferSent (Conneau et al., 2017), and GenSen (Subramanian et al., 2018) on seven STS test sets and two phrase-level similarity tasks. Evaluation on word similarity datasets such as SimLex, WS-353, MEN and SCWS also demonstrates that our model is better at encoding words than other sentence representation models, and even surpasses pre-trained word embedding models by 9.59 points Spearman’s correlation on SimLex.

Generally, our model can be used as a universal encoder that produces fixed-size representations for input sequences of any lengths without additional training for specific tasks. Moreover, it can be further fine-tuned for downstream tasks by simply adding an decoding layer on top of the model with few task-specific parameters.

Figure 1: Illustration of the model architecture.

2 Related Work

Representing words as real-valued dense vectors is a core technology of deep learning in NLP. Trained on massive unsupervised corpus, word embedding models

(Mikolov et al., 2013a; Pennington et al., 2014; Joulin et al., 2016) map words into a vector space where similar words have similar latent representations. Pre-trained word vectors are well known for their good performance on word similarity tasks, while they are limited in representing phrases and sentences. Different from the above mentioned static word embedding models, ELMo (Peters et al., 2018) attempts to learn context-dependent word representations through a two-layer bi-directional LSTM network, where each word is allowed to have different representations according to its contexts. The embedding for each token is the concatenation of hidden states from both directions. Nevertheless, most natural language tasks require representations for higher levels of linguistic units such as phrases and sentences.

Generally, phrase embedding are more difficult to learn than word embeddings. One approach is to treat each phrase as an individual unit and learn its embedding using the same technique for words (Mikolov et al., 2013b)

. However, this method requires the preprocess step of extracting frequent phrases in the corpus, and may suffer from data sparsity. Therefore, embeddings learned in this way are not able to truly represent the meaning of the phrases. Since distributed representations of words is a powerful technique that has already been used as prior for all kinds of NLP tasks, one straightforward and simple idea to obtain a phrase representation is to combine the embeddings of all the words in it. To preserve word order and better capture linguistic information,

Yu and Dredze (2015) come up with complex composition functions rather than simply averaging word embeddings. Based on the analysis of the impact of training data on phrase embeddings, Zhou et al. (2017) propose to train their pairwise-GRU network utilizing large-scaled paraphrase database.

In recent years, more and more researchers have focused on sentence representations since they are widely used in various applications such as information retrieval, sentiment analysis and question answering. The quality of sentence embeddings are usually evaluated in a wide range of transfer tasks. One simple but powerful baseline for learning sentence embeddings is to represent sentence as a weighted sum of word vectors. Inspired by the skip-gram algorithm

(Mikolov et al., 2013a), the SkipThought model (Kiros et al., 2015)

, where both the encoder and decoder are based on Recurrent Neural Network (RNN), is designed to predict the surrounding sentences for an given passage.

Logeswaran and Lee (2018)

improve the model structure by replacing the decoder with a classifier that distinguishes contexts from other sentences. Besides unsupervised training, InferSent

(Conneau et al., 2017) is an bi-directional LSTM sentence encoder that is trained on the Stanford Natural Language Inference (SNLI) dataset. Subramanian et al. (2018) introduce a multi-task framework that combine different training objectives and report considerable improvements in transfer tasks even in low-resource settings. To encode a sentence, researches turn their encoder architecture from RNN as used in SkipThought and InferSent to the Transformer (Vaswani et al., 2017) which relies completely on attention mechanism to perform Seq2Seq training. Cer et al. (2018) develop Universal Sentence Encoder and explore two variants of model settings: the Transformer architecture and the deep averaging network (DAN) (Iyyer et al., 2015) for encoding sentences.

Most recently, transformer-based language models such as OpenAI GPT, BERT, Transformer-xl, XLNet and ALBERT (Radford et al., 2018, 2019; Devlin et al., 2019; Dai et al., 2019; Yang et al., 2019) (Lan et al., 2019) play an increasingly important role in NLP. Unlike feature-based representation methods, BERT follows the the fine-tuning approach where the model is first trained on a large amount of unlabeled data including two training targets: the masked language model (MLM) and the next sentence prediction (NSP) task. Then, the pre-trained model can be easily applied to a wide range of transfer tasks through fine-tuning. During the fine-tuning step, two sentences are concatenated and fed into the input layer so that the contextualized embedding of the special token [CLS] added in front of every input is considered as the representation of the sequence pair. Liu et al. (2019) further improve the performance on ten transfer tasks by fine-tuning BERT through multiple training objectives. These deep neural models are theoretically capable of representing sequences of arbitrary lengths, while experiments show that they perform unsatisfactorily on word and phrase similarity tasks.

Target Paraphrase GoogleNgramSim AGigaSim equivalence score Entailment label
hundreds thousands 0. 0.92851 0.000457 independent
welcomes information that welcomes the fact that 0.16445 0.64532 0.227435 independent
the results of the work the outcome of the work 0.51793. 0.95426 0.442545 entailment
and the objectives of the and purpose of the 0.34082 0.66921 0.286791 entailment
different parts of the world various parts of the world 0.72741 0.97907 0.520898 equivalence
drawn the attention of the drew the attention of the 0.72301 0.92588 0.509006 equivalence
Table 1: Examples from the PPDB database. Pairs of phrases and words are annotated with similarities computed from the Google -grams and the Annotated Gigaword corpus, entailment labels and the score for each label.

3 Methodology

The architecture of the our universal representation model is shown in Figure 1. The lower layers are initialized with either or which contains 12 Transformer blocks (Vaswani et al., 2017), 12 self-attention heads with hidden states of dimension 768. The top layers are task-specific decoders, each consisting of a fully-connected layer followed by a softmax classification layer. We add an additional mean-pooling layer in between to convert a series of hidden states into one fixed-size vector so that the representation of each input is independent of its length. Different from the default fine-tuning procedure of BERT and ALBERT where two sentences are concatenated and encoded as a whole sequence, we process and encode each word, phrase and sentence separately. The model is trained on two kinds of datasets with regard to different levels of linguistic unit on three tasks. In the following subsections, we present a detailed introduction of the datasets and training objectives.

3.1 Datasets

SNLI and Multi-Genre NLI

The Stanford Natural Language Inference (SNLI) (Bowman et al., 2015) and the Multi-Genre NLI Corpus (Williams et al., 2018) are sentence-level datasets that are frequently used to improve and evaluate the performance of sentence representation models. The former consists of 570k sentence pairs that are manually annotated with the labels entailment, contradiction, and neutral. The latter is a collection of 433k sentence pairs annotated with textual entailment information. Both datasets are distributed in the same formats except the latter is derived from multiple distinct genres. Therefore, in our experiment, these two corpora are combined and serve as a single dataset for the sentence-level natural language inference task during training. The training and validation sets contain 942k and 29k sentence pairs, respectively.

Ppdb

The Paraphrase Database (PPDB) (Ganitkevitch et al., 2013) contains millions of multilingual paraphrases that are automatically extracted from bilingual parallel corpora. Each pair includes one target and its paraphrase, companied with entailment information and their similarity score computed from the Google -grams and the Annotated Gigaword corpus. Relationships between pairs fall into six categories: Equivalence, ForwardEntailment, ReverseEntailment, Independent, Exclusion and OtherRelated

. Pairs are marked with scores indicating the probabilities that they belong to each of the above five categories. The PPDB database is divided into six sizes, from S up to XXXL, and contains three types of paraphrases according to their lengths and rules:

lexical, phrasal, and syntactic. To improve the model’s ability of representing phrases and words, we take advantage of the phrasal dataset with S size, consisting of 1.53 million multiword to single/multiword pairs. We apply a preprocessing step to filter and normalize the data for the phrase and word level tasks. Specifically, pairs tagged with Exclusion or OtherRelated are removed, and ForwardEntailment and ReverseEntailment are both treated as entailment since our model structure is symmetrical. Finally, we randomly select 354k pairs from each of the three labels: equivalence, entailment and independent, resulting in a total of 1.06 million examples.

3.2 Training Objectives

Sentence-level Natural Language Inference

Natural Language Inference (NLI) is a pairwise classification problem that is to identify the relationship between a premise and a hypothesis from entailment, contradiction, and neutral, where and are are the number of tokens in and , respectively. Our model is trained on the collection of SNLI and Multi-Genre NLI corpora to perform sentence-level encoding. Different from the default preprocess procedure of BERT and ALBERT, the premise and hypothesis are tokenized and encoded separately, resulting in two fixed length vectors and . Both sentences share the same set of model parameters during the encoding procedure. We then compute , which is the concatenation of the premise and hypothesis representations and the absolute value of their difference, and finally feed it to a fully-connected layer followed by a 3-way softmax classification layer. The probability that a sentence pair is labeled as class is predicted as:

Phrase/word-level Paraphrase Identification

In order to map lower-level linguistic units into the same vector space as sentences, the model is trained to distinguish between paraphrases and non-paraphrases using a large number of phrase and word pairs from the PPDB dataset. Each paraphrase pair involves a target and its paraphrase , where and are the number of tokens in and , respectively. Each target or its paraphrase is a single word or a phrase composed of up to 6 words. Similar to (Zhou et al., 2017), we use the negative sampling strategy (Mikolov et al., 2013b) to reconstruct the dataset. For each target , we randomly sample sequences from the dataset and annotate pairs with negative labels, indicating they are not paraphrases. The encoding and predicting steps are the same as mentioned in the previous paragraph. The relationship between and

is predicted by a logistic regression with softmax:

Phrase/word-level Pairwise Text Classification

Apart from the paraphrase identification task, we design a phrase-level and word-level pairwise text classification task to make use of the phrasal entailment information in the PPDB dataset. For each paraphrase pair , the model is trained to recognize from three types of relationships: equivalence, entailment and independent. This task is more challenging than the previous one, because the model tries to capture the degree of similarity between phrases and words while words are considered dissimilar even if they are closely related. Examples of paraphrase pairs with different entailment labels and their similarity scores are presented in Table 1, where ”hundreds” and ”thousands” are labeled as independent. A one-layer classifier is used to determine the entailment label for each pair:

3.3 Training details

Our model contains one shared encoder with 12 Transformer blocks, 12 self-attention heads and three task-specific decoders. The dimension of hidden states are 768. In each iteration we train batches from three tasks in turn and make sure that the sentence-level task is trained every 2 batches. We use the Adam optimizer with , , and . We perform warmup over the first 10% training data and linearly decay learning rate. The batch size is 16 and the dropout rate is 0.1.

Method Setences Phrases Words
STS12 STS13 STS14 STS15 STS16 STS B SICK-R Avg. PPDB SemEval SimLex WS-sim WS-rel MEN SCWS Avg.
pre-trained word embedding models
Avg. GloVe 53.28 50.76 55.63 59.22 57.88 62.96 71.83 58.79 35.93 68.20 40.82 80.15 64.43 80.49 62.90 65.76
Avg. FastText 58.84 58.83 63.42 69.05 68.24 68.26 72.98 65.66 32.15 68.85 50.31 83.38 73.43 84.55 69.40 72.21
sentence representation models
SkipThought 44.27 30.71 39.06 46.73 54.22 73.74 79.21 52.56 37.07 66.41 35.09 61.27 42.15 57.87 58.44 50.96
InferSent-2 62.92 56.08 66.36 74.01 72.89 78.48 83.06 70.54 36.29 78.19 55.88 71.08 44.36 77.40 61.42 62.03
60.85 55.62 62.80 73.46 66.59 78.59 82.58 68.64 48.55 64.31 49.99 56.06 33.96 59.25 59.01 51.65
pre-trained contextualized language models
32.50 23.99 28.50 35.51 51.08 50.40 64.23 60.60 30.19 75.86 7.25 23.06 1.83 19.05 28.05 14.44
47.91 45.28 52.64 60.77 60.94 61.28 71.18 57.14 44.85 76.95 16.67 30.68 14.60 26.80 34.45 24.64
50.06 52.91 54.91 63.37 64.94 64.48 73.50 40.89 42.76 71.18 13.05 2.99 11.22 21.75 23.18 15.85
our methods
ALBERT fine-tune 67.24 72.64 71.44 77.77 73.48 79.73 83.46 75.11 66.71 75.90 59.58 73.85 61.01 69.50 63.86 65.56
BERT fine-tune 69.87 73.68 72.77 78.46 73.61 84.34 84.79 76.79 70.64 79.31 60.75 71.48 55.70 68.47 62.31 63.74
Table 2: Performance of our model and baseline model on word, phrase and sentence similarity tasks.”SemEval” stands for the SemEval Task 5(a). The subscript indicates the pooling strategy used to obtain fixed length representations. The best results are in bold. Underlined cells shows tasks where our model outperforms all sentence representation models.

4 Evaluation

Our model is evaluated on several text similarity tasks with respect to different levels of linguistic units, i.e, sentences, phrases and words. We use cosine similarity to measure the distance between two sequences:

where is the dot product of and , and is the -norm of the vector. Then Spearman’s correlation between these cosine similarities and golden labels is computed to investigate how much semantic information is captured by our model.

Baselines

We use several currently popular word embedding models and sentence representation models as our baselines. Pre-trained word embeddings used in this work include GloVe (Pennington et al., 2014) and FastText (Joulin et al., 2016), both of which have dimensionality 300. They represent a phrase or a sentence by averaging all the vectors of words it contains. SkipThought (Kiros et al., 2015) is a encoder-decoder structure trained through an unsupervised approach, where both the encoder and decoder are composed of GRU units. Two versions of trained models are provided: unidirectional and bidirectional. In our experiment, the concatenation of the last hidden states produced by the two models are considered to be the representation of the input sequence. The dimensionality is 4800. InferSent (Conneau et al., 2017)

is a bidirectional LSTM encoder trained on the SNLI dataset with a max-pooling layer. It has two versions: InferSent1 with GloVe vectors and InferSent2 with FastText vectors. The latter is evaluated in our experiment. The output vector is 4096-dimensional. GenSen

(Subramanian et al., 2018)

is trained through a multi-task learning framework to learn general-purpose representations for sentences. The encoder is a bidirectional GRU. Multiple trained models of different training settings are publicly available and we choose the single-layer models that are trained on skip-thought vectors, neural machine translation, constituency parsing and natural language inference tasks. Since the last hidden states work better than max-pooling on the STS datasets, they are used as representations for sequences in our experiment. The dimensionality of the embeddings are 4096. In addition, we examine the pre-trained language models

using different pooling strategies: mean-pooling, max-pooling and the [CLS] token. The dimension is 768.

4.1 Sentence-level evaluation

We evaluate our model’s ability of encoding sentences on SentEval (Conneau and Kiela, 2018), including datasets that require training: the STS benchmark (Cer et al., 2017) and the SICK-Relatedness dataset (Marelli et al., 2014), and datasets that do not require training: the STS tasks 2012-2016 (Agirre et al., 2012, 2013, 2014, 2015, 2016). For our model and baseline models, we encode each sentence into a fixed length vector with their corresponding encoders and pooling layers, and then compute cosine similarity between each sentence pairs. For the STS benchmark and the SICK-Relatedness dataset, the model is first trained as a regression task and then evaluated on the test set. The results are displayed in the first eight columns of Table 2.

According to the results, our model outperforms all the baseline models on seven evaluated datasets, achieving significant improvements on sentence similarity tasks. As expected, averaging pre-trained word vectors leads to inferior sentence embeddings than well-trained sentence representation models. Among three RNN-based sentence encoders, InferSent and GenSen obtain higher correlation than SkipThought, but not as good as our model. The last five rows in the table indicate that fine-tuning on pre-trained models essentially improves the quality of sentence embeddings and BERT is better than ALBERT of the same size. Benefiting from the training of SNLI and Multi-Genre NLI corpora using the siamese structure, our model can be directly used as a sentence encoder to extract features for downstream tasks without further training of the parameters. Besides, our model can also be efficiently fine-tuned to generate sentence vectors for specific tasks by adding an additional decoding layer such as a softmax classifier.

Setences Phrases Words
STS12 STS13 STS14 STS15 STS16 STS B SICK-R Avg. PPDB SemEval SimLex WS-sim WS-rel MEN SCWS Avg.
Traning Objectives
NLI 70.58 72.24 72.50 79.08 73.22 85.43 85.18 76.89 31.47 77.26 57.94 24.93 29.95 55.05 47.56 30.78
NLI+PI 69.57 73.34 73.27 76.10 72.63 82.71 84.71 76.05 40.44 77.27 59.62 69.38 57.39 67.02 64.39 63.56
NLI+PTC 66.33 73.57 71.35 77.33 73.47 84.16 84.92 75.88 76.94 78.27 41.17 56.10 29.99 48.35 55.36 46.19
NLI+PI+PTC 69.87 73.68 72.77 78.46 73.61 84.34 84.79 76.79 70.64 79.31 60.75 71.48 55.70 68.47 62.31 63.74
Pooling Strategies
CLS 66.79 72.29 70.08 75.36 71.49 82.55 85.04 74.80 68.09 78.13 47.96 63.48 42.84 62.71 57.27 54.85
MAX 68.93 72.99 71.19 76.82 71.56 78.48 83.64 74.80 68.49 77.55 52.42 64.59 45.97 61.97 60.96 57.18
MEAN 69.87 73.68 72.77 78.46 73.61 84.34 84.79 76.79 70.64 79.31 60.75 71.48 55.70 68.47 62.31 63.74
Concatenation Methods
44.75 33.94 35.39 39.16 37.60 77.64 80.98 49.92 31.29 69.21 7.82 12.51 8.88 19.59 22.84 14.33
50.96 59.37 56.48 60.29 55.91 84.17 84.12 64.47 41.53 76.86 50.75 51.57 22.40 48.99 53.86 45.51
62.86 70.17 66.51 69.79 66.12 81.72 84.15 71.62 70.22 75.16 51.38 62.76 42.81 62.90 58.29 55.63
69.87 73.68 72.77 78.46 73.61 84.34 84.79 76.79 70.64 79.31 60.75 71.48 55.70 68.47 62.31 63.74
64.45 71.56 70.52 75.21 69.30 83.46 84.63 74.16 62.31 77.39 53.08 68.95 52.38 64.31 59.30 59.60
63.92 70.02 69.26 77.84 72.34 83.37 84.80 74.52 69.65 77.34 50.31 71.53 46.30 60.95 58.23 57.46
65.13 72.31 70.89 78.12 73.25 84.02 85.04 75.54 62.31 78.00 52.06 64.68 42.91 63.53 58.60 56.36
Negative Sampling
69.08 72.64 72.68 77.31 74.68 84.05 85.19 76.52 60.60 78.18 60.39 54.55 45.72 61.27 58.89 56.16
69.87 73.68 72.77 78.46 73.61 84.34 84.79 76.79 70.64 79.31 60.75 71.48 55.70 68.47 62.31 63.74
68.95 71.70 70.67 75.51 72.52 83.14 84.88 75.34 72.25 77.27 54.58 64.55 48.35 63.49 59.76 58.15
67.34 71.85 71.39 76.32 72.39 83.06 84.53 75.44 69.86 77.94 59.30 68.82 50.88 66.74 60.58 61.26
Table 3: Ablation study on training objectives, pooling strategies, concatenation methods and the value of k in negative sampling. Lower layers are initialized with . ”NLI”, ”PI” and ”PTC” stands for sentence-level natural language inference, phrase/word-level paraphrase identification and phrase/word-level pairwise text classification tasks, respectively. ”PPDB” represents the phrase-level similarity test set extracted from the PPDB database.

4.2 Phrase-level evaluation

It is not enough to generate high-quality sentence embeddings, our model is expected to encode semantic information for lower-level linguistic units as well. In this experiment, we perform phrase-level evaluation on SemEval2013 Task 5(a) (Korkontzelos et al., 2013) which is a task to classify whether a pair of sequences are semantically similar or not. Each pair contains a word and a phrase consisting of multiple words, coming with either a negative or a positive label. Examples like (megalomania, great madness) are labeled dissimilar although the two sequences are related in certain aspects. Our model and baseline models are first fine-tuned on the training set of size 11722 and then evaluated on 7814 examples. Parameters of the pre-trained word embeddings, SkipThought, InferSent and GenSen are held fixed during training. Due to limited resource of phrasal semantic datasets, we design a phrase-level semantic similarity test set in the same format as the STS datasets. Specifically, we select pairs from the test set of PPDB and filter out ones containing only single words, resulting in 32202 phase pairs. In the following experiments, the score of equivalence annotated in the original dataset is considered as the relative similarity between the two phrases.

As shown in the middle two columns of Table 2, our model outperform all the baseline models on SemEval2013 Task 5(a) and PPDB, suggesting that our model learns semantically coherent phrase embeddings. Note the inconsistent behavior among models on different datasets since the two datasets are distributed differently. Sequences in SemEval2013 Task 5(a) are either single words or two-word phrases. Therefore, Glove and FastText perform better against SkipThought and GenSen on that dataset. Fine-tuning the pre-trained results in higher accuracy than unsupervised models like GloVe, FastText and SkipThought, and is comparable to InferSent and the general-purpose sentence encoder. By first training using a large amount of supervised data, then fine-tuning on SemEval training data, our model yields the highest accuracy. On PPDB where most phrases consist more than two words, RNNs and Transformers are more preferable than simply averaging word embeddings. Especially, our model obtains the highest correlation of 70.64.

4.3 Word-level evaluation

Word-level evaluation are conducted on several commonly used word similarity task datasets: SimLex (Hill et al., 2015), WS-353 (Finkelstein et al., 2001), MEN (Bruni et al., 2012), SCWS (Huang et al., 2012). As mentioned in Faruqui et al. (2016), word similarity is often confused with relatedness in some datasets due to the subjectivity of human annotations. To alleviate this problem, WS-353 is later divided into two sub-datasets (Agirre et al., 2009): pairs of words that are similar, and pairs of words that are related. For example, ”food” and ”fruit” are similar while ”computer” and ”keyboard” are related. The newly constructed dataset SimLex aim to explicitly quantifies similarity rather than relatedness, where similar words are annotated with higher scores and related words are considered dissimilar with lower scores. The numbers of word pairs in SimLex, WS-353, MEN and SCWS are 999, 353, 3000 and 2003, respectively. Cosine similarity between word vectors are computed and Spearman’s correlation is used for evaluation. Results are depicted in the last five columns of Table 2.

Although pre-trained word embeddings are considered excellent at encoding word semantics, fine-tuning obtains the highest correlations on SimLex, 9.11 higher than the best results reported by all the baseline models. Compared with all the three sentence representation models, our model yields the best results on 4 out of 5 datasets. The only dataset where InferSent performs better than our model is MEN. Because InferSent is trained by initializing its embedding layers with pre-trained word vectors, it inherited high-quality word embeddings from FastText to some extent. However, our model is initialized with pre-trained models and fine-tuned using NLI and PPDB datasets, leading to significant improvements in representing words, with an average correlation of 63.74 using and 65.56 using compared to only 24.64 using the pre-trained .

5 Ablation Study

Our model is trained on the NLI and PPDB datasets using three training objectives in terms of different levels of linguistic units, leading to powerful performance for mapping words, phrases and sentences into the same vector space. In this section, we look into how variants of training objectives, pooling and concatenation strategies and some hyper parameters affect model’s performance and figure out the overall contribution of each module. The lower layers are initialized with in the following experiments.

5.1 Training Objective

We train our model on different combinations of training objectives to investigate in what aspect does the model benefit from each of them. According to the first block in Table 3, training on the NLI dataset through sentence-level task can effectively improve the quality of sentence embeddings, but it is not helpful enough when it comes to phrases and words. The phrase and word level tasks using the PPDB dataset are able to address this limitation. Especially, the phrase-level and word-level pairwise text classification task has an positive impact on phrase embeddings. Furthermore, when trained on the paraphrase identification task, the model is able to produce word embeddings that are almost as good as pre-trained word vectors. By combining the characteristics of different training objectives, our model have an advantage in encoding sentences, meanwhile achieving considerable improvement in phrase-level and word-level tasks.

5.2 Pooling and Concatenation

We apply different pooling strategies, including mean-pooling, max-pooling and the [CLS] token, and different feature concatenation methods. When investigating the former, we use as the input feature for classification, and as for the latter, we choose mean-pooling as the default strategy. Results are shown in the second and third blocks in Table 3. In accordance with Reimers and Gurevych (2019), we find averaging BERT token embeddings outperforms other pooling methods. Besides, hadamard product is not helpful in our experiment. Generally, the model is more sensitive to concatenation methods while pooling strategies have a minor influence. For a comprehensive consideration, mean-pooling is preferred in our experiment, and the concatenation of two vectors along with their absolute difference is more suitable than other combinations.

5.3 Negative Sampling

In the paraphrase identification task, we randomly select negative samples for each pair to force the model to identify paraphrases from non-paraphrases. Evidence has shown that the value of in negative sampling has an impact on phrase and word embeddings (Mikolov et al., 2013b). When training word embeddings using negative sampling, setting in the range of 5-20 is recommended for small training datasets, while for large datasets the can be as small as 2-5. In this ablation experiment, we explore the optimal value of for our paraphrase identification task. Since the PPDB dataset is extremely large, with more than one million positive pairs, we perform four experiments, in which we only change the value of and other model settings are maintained the same. The last block in Table 3 illustrates results using values of 1, 3, 5 and 7, from which we can conclude that is an appropriate choice. Keeping increasing the value of has no positive effect on phrase and word embeddings and even decreases the performance in sentence tasks.

6 Conclusion

In this work, we propose to learn a universal encoder that maps sequences of different lengths into the same vector space where similar sequences have similar representations. We introduce training with three objectives on the NLI and PPDB datasets through a siamese network, and evaluate our model on a wide range of similarity tasks with regard to multiple levels of linguistic units (sentences, phrases and words). Overall, our model outperforms all the baseline models on sentence and phrase level evaluations, and generates high-quality word vectors that are almost as good as pre-trained word embeddings.

References

  • E. Agirre, E. Alfonseca, K. Hall, J. Kravalova, M. Paundefinedca, and A. Soroa (2009) A study on similarity and relatedness using distributional and wordnet-based approaches. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, NAACL ’09, USA, pp. 19–27. External Links: ISBN 9781932432411 Cited by: §4.3.
  • E. Agirre, C. Banea, C. Cardie, D. Cer, M. Diab, A. Gonzalez-Agirre, W. Guo, I. Lopez-Gazpio, M. Maritxalar, R. Mihalcea, G. Rigau, L. Uria, and J. Wiebe (2015) SemEval-2015 task 2: semantic textual similarity, English, Spanish and pilot on interpretability. In Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), Denver, Colorado, pp. 252–263. External Links: Document Cited by: §4.1.
  • E. Agirre, C. Banea, C. Cardie, D. Cer, M. Diab, A. Gonzalez-Agirre, W. Guo, R. Mihalcea, G. Rigau, and J. Wiebe (2014) Semeval-2014 task 10: multilingual semantic textual similarity. In Proceedings of the 8th international workshop on semantic evaluation (SemEval 2014), pp. 81–91. Cited by: §4.1.
  • E. Agirre, C. Banea, D. Cer, M. Diab, A. Gonzalez-Agirre, R. Mihalcea, G. Rigau, and J. Wiebe (2016) SemEval-2016 task 1: semantic textual similarity, monolingual and cross-lingual evaluation. In Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), San Diego, California, pp. 497–511. External Links: Document Cited by: §4.1.
  • E. Agirre, D. Cer, M. Diab, A. Gonzalez-Agirre, and W. Guo (2013) *SEM 2013 shared task: semantic textual similarity. In Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 1: Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity, Atlanta, Georgia, USA, pp. 32–43. Cited by: §4.1.
  • E. Agirre, D. Cer, M. Diab, and A. Gonzalez-Agirre (2012) SemEval-2012 task 6: a pilot on semantic textual similarity. In *SEM 2012: The First Joint Conference on Lexical and Computational Semantics – Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012), Montréal, Canada, pp. 385–393. Cited by: §4.1.
  • S. R. Bowman, G. Angeli, C. Potts, and C. D. Manning (2015) A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, pp. 632–642. External Links: Document Cited by: §1, §3.1.
  • E. Bruni, G. Boleda, M. Baroni, and N. Tran (2012) Distributional semantics in technicolor. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Jeju Island, Korea, pp. 136–145. Cited by: §4.3.
  • D. Cer, M. Diab, E. Agirre, I. Lopez-Gazpio, and L. Specia (2017) SemEval-2017 task 1: semantic textual similarity multilingual and crosslingual focused evaluation. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), Vancouver, Canada, pp. 1–14. External Links: Document Cited by: §4.1.
  • D. Cer, Y. Yang, S. Kong, N. Hua, N. Limtiaco, R. St. John, N. Constant, M. Guajardo-Cespedes, S. Yuan, C. Tar, B. Strope, and R. Kurzweil (2018) Universal sentence encoder for English. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Brussels, Belgium, pp. 169–174. External Links: Document Cited by: §1, §2.
  • A. Conneau, D. Kiela, H. Schwenk, L. Barrault, and A. Bordes (2017) Supervised learning of universal sentence representations from natural language inference data. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, pp. 670–680. External Links: Document Cited by: §1, §1, §2, §4.
  • A. Conneau and D. Kiela (2018) SentEval: an evaluation toolkit for universal sentence representations. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. Cited by: §4.1.
  • Z. Dai, Z. Yang, Y. Yang, J. Carbonell, Q. Le, and R. Salakhutdinov (2019) Transformer-XL: attentive language models beyond a fixed-length context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 2978–2988. External Links: Document Cited by: §2.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4171–4186. External Links: Document Cited by: Learning Better Universal Representations from Pre-trained Contextualized Language Models, §1, §2.
  • M. Faruqui, Y. Tsvetkov, P. Rastogi, and C. Dyer (2016) Problems with evaluation of word embeddings using word similarity tasks. In Proceedings of the 1st Workshop on Evaluating Vector-Space Representations for NLP, Berlin, Germany, pp. 30–35. External Links: Document Cited by: §4.3.
  • L. Finkelstein, E. Gabrilovich, Y. Matias, E. Rivlin, Z. Solan, G. Wolfman, and E. Ruppin (2001) Placing search in context: the concept revisited. In Proceedings of the 10th international conference on World Wide Web, pp. 406–414. Cited by: §4.3.
  • J. Ganitkevitch, B. Van Durme, and C. Callison-Burch (2013) PPDB: the paraphrase database. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Atlanta, Georgia, pp. 758–764. Cited by: §1, §3.1.
  • F. Hill, R. Reichart, and A. Korhonen (2015)

    SimLex-999: evaluating semantic models with (genuine) similarity estimation

    .
    Computational Linguistics 41 (4), pp. 665–695. External Links: Document Cited by: §4.3.
  • E. Huang, R. Socher, C. Manning, and A. Ng (2012) Improving word representations via global context and multiple word prototypes. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Jeju Island, Korea, pp. 873–882. Cited by: §4.3.
  • M. Iyyer, V. Manjunatha, J. L. Boyd-Graber, and H. D. III (2015) Deep unordered composition rivals syntactic methods for text classification.. In ACL (1), pp. 1681–1691. External Links: ISBN 978-1-941643-72-3 Cited by: §2.
  • A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov (2016) Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759. Cited by: §1, §2, §4.
  • R. Kiros, Y. Zhu, R. R. Salakhutdinov, R. Zemel, R. Urtasun, A. Torralba, and S. Fidler (2015) Skip-thought vectors. In Advances in neural information processing systems, pp. 3294–3302. Cited by: §1, §2, §4.
  • I. Korkontzelos, T. Zesch, F. M. Zanzotto, and C. Biemann (2013) SemEval-2013 task 5: evaluating phrasal semantics. In Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013), Atlanta, Georgia, USA, pp. 39–47. Cited by: §4.2.
  • Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut (2019) ALBERT: a lite bert for self-supervised learning of language representations. External Links: 1909.11942 Cited by: §1, §2.
  • X. Liu, P. He, W. Chen, and J. Gao (2019) Multi-task deep neural networks for natural language understanding. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 4487–4496. External Links: Document Cited by: §2.
  • L. Logeswaran and H. Lee (2018) An efficient framework for learning sentence representations. In International Conference on Learning Representations (ICLR), Cited by: §2.
  • M. Marelli, S. Menini, M. Baroni, L. Bentivogli, R. Bernardi, and R. Zamparelli (2014) A SICK cure for the evaluation of compositional distributional semantic models. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), Reykjavik, Iceland, pp. 216–223. Cited by: §4.1.
  • T. Mikolov, K. Chen, G. Corrado, and J. Dean (2013a) Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. Cited by: §1, §2, §2.
  • T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean (2013b) Distributed representations of words and phrases and their compositionality. External Links: 1310.4546 Cited by: §2, §3.2, §5.3.
  • J. Pennington, R. Socher, and C. Manning (2014) Glove: global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, pp. 1532–1543. External Links: Document Cited by: §1, §2, §4.
  • M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer (2018) Deep contextualized word representations. In Proceedings of NAACL-HLT, pp. 2227–2237. Cited by: §1, §2.
  • A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever (2018) Improving language understanding by generative pre-training. URL https://s3-us-west-2.amazonaws.com/openai-assets/researchcovers/languageunsupervised/language understanding paper. pdf. Cited by: §1, §2.
  • A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever (2019) Language models are unsupervised multitask learners. OpenAI Blog 1 (8). Cited by: §2.
  • N. Reimers and I. Gurevych (2019) Sentence-BERT: sentence embeddings using Siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 3982–3992. External Links: Document Cited by: §5.2.
  • S. Subramanian, A. Trischler, Y. Bengio, and C. J. Pal (2018) Learning general purpose distributed sentence representations via large scale multi-task learning. In International Conference on Learning Representations, Cited by: §1, §1, §2, §4.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §2, §3.
  • A. Williams, N. Nangia, and S. Bowman (2018) A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana, pp. 1112–1122. External Links: Document Cited by: §1, §3.1.
  • Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. Salakhutdinov, and Q. V. Le (2019) XLNet: generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237. Cited by: §1, §2.
  • M. Yu and M. Dredze (2015) Learning composition models for phrase embeddings. Transactions of the Association for Computational Linguistics 3, pp. 227– 242. Cited by: §2.
  • Z. Zhou, L. Huang, and H. Ji (2017) Learning phrase embeddings from paraphrases with GRUs. In Proceedings of the First Workshop on Curation and Applications of Parallel and Comparable Corpora, Taipei, Taiwan, pp. 16–23. Cited by: §2, §3.2.