Code for Paper: SBERT-WK: A Sentence Embedding Method By Dissecting BERT-based Word Models
Sentence embedding is an important research topic in natural language processing (NLP) since it can transfer knowledge to downstream tasks. Meanwhile, a contextualized word representation, called BERT, achieves the state-of-the-art performance in quite a few NLP tasks. Yet, it is an open problem to generate a high quality sentence representation from BERT-based word models. It was shown in previous study that different layers of BERT capture different linguistic properties. This allows us to fusion information across layers to find better sentence representation. In this work, we study the layer-wise pattern of the word representation of deep contextualized models. Then, we propose a new sentence embedding method by dissecting BERT-based word models through geometric analysis of the space spanned by the word representation. It is called the SBERT-WK method. No further training is required in SBERT-WK. We evaluate SBERT-WK on semantic textual similarity and downstream supervised tasks. Furthermore, ten sentence-level probing tasks are presented for detailed linguistic analysis. Experiments show that SBERT-WK achieves the state-of-the-art performance. Our codes are publicly available.READ FULL TEXT VIEW PDF
Code for Paper: SBERT-WK: A Sentence Embedding Method By Dissecting BERT-based Word Models
Static word embedding is a popular learning technique that transfers prior knowledge from a large unlabeled corpus [1, 2, 3]. Most of recent sentence embedding methods are rooted in that static word representations can be embedded with rich syntactic and semantic information. It is desired to extend the word-level embedding to the sentence-level, which contains a longer piece of texts. We have witnessed a breakthrough by replacing the “static” word embedding to the “contextualized” word representation in the last several years, e.g., [4, 5, 6, 7]. A natural question to ask is how to exploit contextualized word embedding in the context of sentence embedding. Here, we examine the problem of learning the universal representation of sentences. A contextualized word representation, called BERT, achieves the state-of-the-art performance in many natural language processing (NLP) tasks. We aim at developing a sentence embedding solution from BERT-based models in this work.
As reported in  and , different layers of BERT learns different levels of information and linguistic properties. While intermediate layers encode the most transferable features, representation from higher layers are more expressive in high-level semantic information. Thus, information fusion across layers has its potential in providing a stronger representation. Furthermore, by conducting experiments on patterns of the isolated word representation across layers in deep models, we observe the following property. Words of richer information in a sentence have higher variation in their representations, while the token representation changes gradually, across layers. This finding helps define “salient” word representations and informative words in computing universal sentence embedding.
Although the BERT-based contextualized word embedding method performs well in NLP tasks 
, it has its own limitations. For example, due to the large model size, it is time consuming to perform sentence pair regression such as clustering and semantic search. The most effective way to solve this problem is through an improved sentence embedding method, which transforms a sentence to a vector that encodes the semantic meaning of the sentence. Currently, a common sentence embedding approach based on BERT-based models is to average the representations obtained from the last layer or using the CLS token for sentence-level prediction. Yet, both are sub-optimal as shown in the experimental section of this paper. To the best of our knowledge, there is only one paper on sentence embedding using pre-trained BERT, called Sentence-BERT or SBERT. It leverages further training with high-quality labeled sentence pairs. Apparently, how to obtain sentence embedding from deep contextualized models is still an open problem.
While word embedding is learned using a loss function defined on word pairs, sentence embedding demands a loss function defined at the sentence-level. Following a path similar to word embedding, unsupervised learning of sentence encoders, e.g., SkipThought and FastSent , build self-supervision from a large unlabeled corpus. Yet, InferSent  shows that training on high quality labeled data, e.g., the Stanford Natural Language Inference (SNLI) dataset, can consistently outperform unsupervised training objectives. Recently, leveraging training results from multiple tasks has become a new trend in sentence embedding since it provides better generalization performance. USE  incorporates both supervised and unsupervised training objectives on the Transformer architecture. The method in  is trained in a multi-tasking manner so as to combine inductive biases of diverse training objectives. However, multi-tasking learning for sentence embedding is still under development, and it faces some difficulty in selecting supervised tasks and handling interactions between tasks. Furthermore, supervised training objectives demand high quality labeled data which are usually expensive.
Being drastically different from the above-mentioned research, we investigate sentence embedding by studying the geometric structure of deep contextualized models and propose a new method by dissecting BERT-based word models. It is called the SBERT-WK method. As compared with previous sentence embedding models that are trained on sentence-level objectives, deep contextualized models are trained on a large unlabeled corpus with both word- and sentence-level objectives. SBERT-WK inherits the strength of deep contextualized models. It is compatible with most deep contextualized models such as BERT  and SBERT .
This work has the following three main contributions.
We study the evolution of isolated word representation patterns across layers in BERT-based models. These patterns are shown to be highly correlated with word’s content. It provides useful insights into deep contextualized word models.
We propose a new sentence embedding method, called SBERT-WK, through geometric analysis of the space learned by deep contextualized models.
We evaluate the SBERT-WK method against eight downstream tasks and seven semantic textual similarity tasks, and show that it achieves state-of-the-art performance. Furthermore, we use sentence-level probing tasks to shed light on the linguistic properties learned by SBERT-WK.
The rest of the paper is organized as following. Related work is reviewed in Sec. II. The evolution of word representation patterns in deep contextualized models is studied in Sec. III. The proposed SBERT-WK method is presented in Sec. IV. The SBERT-WK method is evaluated with respect to various tasks in Sec. V. Finally, concluding remarks and future work directions are given in Sec. VI.
Traditional word embedding methods provide a static representation for a word in a vocabulary set. Although the static representation is widely adopted in NLP, it has several limitations in modeling the context information. First, it cannot deal with polysemy. Second, it cannot adjust the meaning of a word based on its contexts. To address the shortcomings of static word embedding methods, there is a new trend to go from shallow to deep contextualized representations. For example, ELMo , GPT1 , GPT2  and BERT  are pre-trained deep neural language models, and they can be fine-tuned on specific tasks. These new word embedding methods achieve impressive performance on a wide range of NLP tasks. In particular, the BERT-based models are dominating in leaderboards of language understanding tasks such as SQuAD2.0  and GLUE benchmarks .
ELMo is one of the earlier work in applying a pre-trained language model to downstream tasks . It employs two layer bi-directional LSTM and fuses features from all LSTM outputs using task-specific weights. OpenAI GPT  incorporates a fine-tuning process when it is applied to downstream tasks. Tasks-specific parameters are introduced and fine-tuned with all pre-trained parameters. BERT employs the Transformer architecture , which is composed by multiple multi-head attention layers. It can be trained more efficiently than LSTM. It is trained on a large unlabeled corpus with several objectives to learn both word- and sentence-level information, where the objectives include masked language modeling as well as the next sentence prediction. A couple of variants have been proposed based on BERT. RoBERTa  attempts to improve BERT by providing a better recipe in BERT model training. ALBERT  targets at compressing the model size of BERT by introducing two parameter-reduction techniques. At the same time, it achieves better performance. XLNET  adopts a generalized auto-regressive pre-training method that has the merits of auto-regressive and auto-encoder language models.
Because of the superior performance of BERT-based models, it is important to have a better understanding of BERT-based models and the transformer architecture. Efforts have been made along this direction recently as reviewed below. Liu et al.  and Petroni et al.  used word-level probing tasks to investigate the linguistic properties learned by the contextualized models experimentally. Kovaleva et al.  and Michel et al.  attempted to understand the self-attention scheme in BERT-based models. Hao et al.  provided insights into BERT by visualizing and analyzing the loss landscapes in the fine-tuning process. Ethayarajh  explained how the deep contextualized model learns the context representation of words. Despite the above-mentioned efforts, the evolving pattern of a word representation across layers in BERT-based models has not been studied before. In this work, we first examine the pattern evolution of a token representation across layers without taking its context into account. With the context-independent analysis, we observe that the evolving patterns are highly related to word properties. This observation in turn inspires the proposal of a new sentence embedding method – SBERT-WK.
By sentence embedding, we aim at extracting a numerical representation for a sentence to encapsulate its meanings. The linguistic features learned by a sentence embedding method can be external information resources for downstream tasks. Sentence embedding methods can be categorized into two categories: non-parameterized and parameterized models. Non-parameterized methods usually rely on high quality pre-trained word embedding methods. The simplest example is to average word embedding results as the representation for a sentence. Following this line of thought, several weighted averaging methods were proposed, including tf-idf, SIF , uSIF  and GEM 
. SIF uses the random walk to model the sentence generation process and derives word weights using the maximum likelihood estimation (MLE). uSIF extends SIF by introducing an angular-distance-based random walk model. No hyper-parameter tuning is needed in uSIF. By exploiting geometric analysis of the space spanned by word embeddings, GEM determines word weights with several hand-crafted measurements. Instead of weighted averaging, it uses the-mean  to concatenate the power means of word embeddings and fuses different word embedding models so as to shorten the performance gap between non-parameterized and parameterized models.
Parameterized models are more complex, and they usualy perform better than non-parameterized models. The skip-thought model  extends the unsupervised training of word2vec  from the word level to the sentence level. It adopts the encoder-decoder architecture to learn the sentence encoder. InferSent  employs bi-directional LSTM with supervised training. It trains the model to predict the entailment or contradiction of sentence pairs with the Stanford Natural Language Inference (SNLI) dataset. It achieves better results than methods with unsupervised learning. The USE (Universal Sentence Encoder) method  extends the InferSent model by employing the Transformer architecture with unsupervised as well as supervised training objectives. It was observed by later studies ,  that training with multiple objectives in sentence embedding can provide better generalizability.
The SBERT method  is the only parameterized sentence embedding model using BERT as the backbone. SBERT shares high similarity with InferSent . It uses the Siamese network on top of the BERT model and fine-tunes it based on high quality sentence inference data (e.g. the SNLI dataset) to learn more sentence-level information. However, unlike supervised tasks, universal sentence embedding methods in general do not have a clear objective function to optimize. Instead of training on more sophisticated multi-tasking objectives, we combine the advantage of both parameterized and non-parameterized methods. SBERT-WK is computed by subspace analysis of the manifold learned by the parameterized BERT-based models.
In signal processing and data science, subspace learning and analysis offer powerful tools for multidimensional data processing. Correlated data of a high dimension can be analyzed using latent variable representation methods such as Principal Component Analysis (PCA), Singular Spectrum Analysis (SSA), Independent Component Analysis (ICA) and Canonical Correlation Analysis (CCA). Subspace analysis has solid mathematical foundation. It is used to explain and understand the internal states of Deep Neural Networks, , .
The goal of word or sentence embedding is to map words or sentences onto a high-dimensional space. Thus, subspace analysis is widely adopted in this field, especially for word embedding. Before learning-based word embedding, the factorization-based word embedding methodology is the mainstream, which is used to analyze the co-occurrence statistics of words. To find word representations, Latent Semantic Analysis (LSA) 
factorizes the co-occurrence matrix using singular value decomposition. Levy and Goldberg pointed out the connection between the word2vec  model and the factorization-based methods. Recently, subspace analysis is adopted for interpretable word embedding because of mathematical transparency. Subspace analysis is also widely used in post-processing and evaluation of word embedding models , , .
Because of the success of subspace analysis in word embedding, it is natural to incorporate subspace analysis in sentence embedding as a sentence is composed by a sequence of words. For example, SCDV  determines the sentence/document vector by splitting words into clusters and analyzing them accordingly. GEM  models the sentence generation process as a Gram-Schmidt procedure and expands the subspace formed by word vectors gradually. Both DCT  and EigenSent  map a sentence matrix into the spectral space and model the high-order dynamics of a sentence from a signal processing perspective.
Although subspace analysis has already been applied to sentence embedding, all above-mentioned work was built upon on static word embedding methods. To the best of our knowledge, our work is the first one that exploits subspace analysis to find generic sentence embeddings based on deep contextualized word models. We will show in this work that SBERT-WK can consistently outperform state-of-the-art methods with low computational overhead and good interpretability, which is attributed to high transparency and efficiency of subspace analysis and the power of deep contextualized word embedding.
Evolving word representation patterns across layers measured by cosine similarity, where (a-d) show the similarity across layers and (e-h) show the similarity over different hops. Four contextualized word representation models (BERT, SBERT, RoBERTa and XLNET) are tested.
Although studies have been done in the understanding of the word representation learned by deep contextualized models, none of them examine how a word representation evolves across layers. To observe such an evolving pattern, we design experiments in this section by considering the following four BERT-based models.
BERT . It employs the bi-directional training of the transformer architecture and applies it to language modeling. Unsupervised objectives, including the masked language model and the next sentence prediction, are incorporated.
SBERT . It integrates the Siamese network with a pre-trained BERT model. The supervised training objective is added to learn high quality sentence embedding.
RoBERTa . It adapts the training process of BERT to more general environments such as longer sequences, bigger batches, more data and mask selection schemes, etc. The next sentence prediction objective is removed.
XLNET . It adopts the Transformer-XL architecture, which is trained with the Auto-Regressive (AR) objective.
The above four BERT-based models have two variants; namely, the 12-layer base model and the 24-layer large model. We choose their base models in the experiments, which are pre-trained on their respective language modeling tasks.
To quantify the evolution of word representations across layers of deep contextualized models, we measure the pair-wise cosine similarity between 1- and -hop neighbors. By the 1-hop neighbor, we refer to the representation in the preceding or the succeeding layer of the current layer. Generally, word has representations of dimension for a
-layer transformer network. The whole representation set forcan be expressed as
where denotes the representation of word at the -th layer The pair-wise cosine similarity between representations of the -th and the -th layers can be computed as
To obtain statistical results, we extract word representations from all sentences in the popular STSbenchmark dataset . The dataset contains sentence pairs from three categories: captions, news and forum. The similarity map is non-contextualized. We average the similarity map for all words to present the pattern for contextualized word embedding models.
Figs. 1 (a)-(d) show the similarity matrix across layers for four different models. Figs. 1 (e)-(h) show the patterns along the offset diagonal. In general, we see that the representations from nearby layers share a large similarity value except for that in the last layer. Furthermore, we observe that, except for the main diagonal, offset diagonals do not have an uniform pattern as indicated by the blue arrow in the associated figure. For BERT, SBERT and RoBERTa, the patterns at intermediate layers are flatter as shown in Figs. 1 (e)-(g). The representations between consecutive layers have a cosine similarity value that larger than 0.9. The rapid change mainly comes from the beginning and the last several layers of the network. This explains why the middle layers are more transferable to other tasks as observed in . Since the representation in middle layers are more stable, more generalizable linguistic properties are learned there. As compared with BERT, SBERT and RoBERTa, XLNET has a very different evolving pattern of word representations. Its cosine similarity curve as shown in Fig. 1 (h) is not concave. This can be explained by the fact that XLNET deviates from BERT significantly from architecture selection to training objectives. It also sheds light on why SBERT , which has XLNET as the backbone for sentence embedding generation, has sentence embedding results worse than BERT, given that XLNET is more powerful in other NLP tasks.
We see from Figs. 1 (e)-(g) that the word representation evolving patterns in the lower and the middle layers of BERT, SBERT and RoBERTa are quite similar. Their differences mainly lie in the last several layers. SBERT has the largest drop while RoBERTa has the minimum change in cosine similarity measures in the last several layers. SBERT has the highest emphasis on the sentence-pair objective since it uses the Siamese network for sentence pair prediction. BERT puts some focus on the sentence-level objective via next-sentence prediction. In contrast, RoBERTa removes the next sentence prediction completely in training.
We argue that faster changes in the last several layers are related to the training with the sentence-level objective, where the distinct sentence level information is reflected. Generally speaking, if more information is introduced by a word, we should pay special attention to its representation. To quantify such a property, we propose two metrics (namely, alignment and novelty) in Sec. IV-A.
We have so far studied the evolving pattern of word representations across layers. We may ask whether such a pattern is word dependent. This question can be answered below. As shown in Fig. 1, the offset diagonal patterns are pretty similar with each other in the mean. Without loss of generality, we conduct experiments on the offset-1 diagonal that contains 12 values as indicated by the arrow in Fig. 1. We compute the variances of these 12 values to find the variability of the 1-hop cosine similarity values with respect to different words. The variance is computed for each word in BERT and SBERT222Since RoBERTa and XLNET use a special tokenizer, which cannot be linked to real word pieces, we do not test on RoBERTa and XLNET here.. We only report words that appears more than 50 times to avoid randomness in Table I. The same set of words were reported for BERT and SBERT models. The words are split into three categorizes based on their variance values. The insignificant words in a sentence are underlined. We can clearly see from the table that words in the low variance group are in general less informative. In contrast, words in the high variance group are mostly nouns and verbs, which usually carry richer content. We conclude that more informative words in deep contextualized models vary more while insignificant words vary less. This finding motivates us to design a module that can distinguish important words in a sentence in Sec. IV-B.
We propose a new sentence embedding method called SBERT-WK in this section. The block diagram of the SBERT-WK method is shown in Fig. 2. It consists of the following two steps:
Determine a unified word representation for each word in a sentence by integrating its representations across layers by examining its alignment and novelty properties.
Conduct a weighted average of unified word representations based on the word importance measure to yield the ultimate sentence embedding vector.
They are elaborated in the following two subsections, respectively.
As discussed in Sec. III, the word representation evolves across layers. We use to denote the representation of word at the th layer. To determine the unified word representation, , of word in Step 1, we assign weight to its -th layer representation, , and take an average:
where weight can be derived based on the inverse alignment and the novelty two properties.
We define the context matrix of as
where is the word embedding dimension and is the context window size. We can compute the pair-wise cosine similarity between and all elements in the context window and use their average to measure how aligns with the word vectors inside its context context. Then, the alignment similarity score of can be defined as
If a word representation at a layer aligns well with its context word vectors, it does not provide much additional information. Since it is less informative, we can give it a smaller weight. Thus, we use the inverse of the alignment similarity score as the weight for word at the -th layer. Mathematically, we have
where is a normalization constant independent of and it is chosen to normalize the sum of weights:
We call the inverse alignment weight.
Another way to measure the new information of word representation is to study the new information brought by it with respect to the subspace spanned words in its context window. Clearly, words in the context matrix form a subspace. We can decompose into two components: one contained by the subspace and the other orthogonal to the subspace. We view the orthogonal one as its novel component and use its magnitude as the novelty score. By singular value decomposition (SVD), we can factorize matrix of dimension into the form , where is an matrix with orthogonal columns, is an diagonal matrix with non-negative numbers on the diagonal and is orthogonal matrix. In our current context, we decompose the context matrix in Eq. (4) to to find the orthogonal basis for the context words. The orthogonal column basis for is represented by matrix . Thus, the orthogonal component of with respect to can be computed as
The novelty score of is computed by
where is a normalization constant independent of and it is chosen to normalize the sum of weights:
We call the novelty weight.
We examine two ways to measure the new information brought by word representation at the -th layer. We may consider a weighted average of the two in form of
where and is called the combined weight. We compare the performance of three cases (namely, novelty weight , inverse alignment weight and combined weight ) in the experiments. A unified word representation is computed as a weighted sum of its representations in different layers:
We can view as the new contextualized word representation for word .
As discussed in Sec. III, the variances of the pair-wise cosine-similarity matrix can be used to categorize words into different groups. Words of richer information usually have a larger variance. By following the line of thought, we can use the same variance to determine the importance of a word and merge multiple words in a sentence to determine the sentence embedding vector. This is summarized below.
For the -th word in a sentence denoted by , we first compute its cosine similarity matrix using its word representations from all layers as shown in Eq. (2). Next, we extract the offset-1 diagonal of the cosine similarity matrix, compute the variance of the offset-1 diagonal values and use to denote the variance of the th word. Then, the final sentence embedding () can be expressed as
where is the the new contextualized word representation for word as defined in Eq. (10) and
Note that the weight for each word is the -normalized variance as shown in Eq. 12. To sum up, in our sentence embedding scheme, words that evolve faster across layers with get higher weights since they have larger variances.
The main computational burden of SBERT-WK comes from the SVD decomposition, which allows more fine-grained analysis in novelty measure. The context window matrix is decomposed into the product of three matrices . The orthogonal basis is given by matrix . The context window matrix is of size , where is the word embedding size and is the whole window size. In our case, is much larger than so that the computational complexity for SVD is , where several terms are ignored.
Instead of performing SVD decomposition, we use the QR factorization in our experiments as an alternative because of its computational efficiency. With QR factorization, we first concatenate the center word vector represenation to the context window matrix to form a new matrix
has word representations. We perform the QR factorization on , and obtain , where non-zero columns of matrix are orthonormal basis and is an upper triangular matrix that contains the weights for word representations under the basis of . We denote the th column of and as and , respectively. With QR factorization, is the representation of under the orthogonal basis formed by matrix . The new direction introduced to the context by is represented as . Then, the last component of is the weight for the new direction, which is denoted by . Then, the novelty weight can be derived as:
where is the normalization constant. The inverse alignment weight can also computed under the new basis .
The complexity of the QR factorization is , which is two times faster than the SVD decomposition. In practice, we see little performance difference between these two methods. The experimental runtime is compared in Sec. V-E
Since our goal is to obtain a general purpose sentence embedding method, we evaluate SBERT-WK on three kinds of evaluation tasks.
Semantic textual similarity tasks.
They predict the similarity between two given sentences. They can be used to indicate the embedding ability of a method in terms of clustering and information retrieval via semantic search.
Supervised downstream tasks.
They measure embedding’s transfer capability to downstream tasks including entailment and sentiment classification.
They are proposed in recent years to measure the linguistic features of an embedding model and provide fine-grained analysis.
These three kinds of evaluation tasks can provide an comprehensive test on our proposed model. The popular SentEval toolkit  is used in all experiments. The proposed SBERT-WK method can be built upon several state-of-the-art pre-trained language models such as BERT, SBERT, RoBERTa and XLNET. Here, we evaluate it on top of two models: SBERT333SBERT and XLNET444Huggingface Transformer 2.2.1. The latter is the XLNET pre-trained model obtained from the Transformer Repo. We adopt their base models that contain 12 self-attention layers.
For performance benchmarking, we compare SBERT-WK with the following 10 different methods, including parameterized and non-parameterized models.
Average of GloVe word embeddings;
Average of FastText word embedding;
Average the last layer token representations of BERT;
Use [CLS] embedding from BERT, where [CLS] is used for next sentence prediction in BERT;
SIF model , which is a non-parameterized model that provides a strong baseline in textual similarity tasks;
-mean model  that incorporate multiple word embedding models;
InferSent  with both GloVe and FastText versions;
Universal Sentence Encoder , which is a strong parameterized sentence embedding using multiple objectives and transformer architecture;
Sentence-BERT, which is a state-of-the-art sentence embedding model by training the Siamese network over BERT.
To evaluate semantic textual similarity, we use 2012-2016 STS datasets [45, 46, 47, 48, 49]. They contain sentence pairs and labels between 0 and 5, which indicate their semantic relatedness. Some methods learn a complex regression model that maps sentence pairs to their similarity score. Here, we use the cosine similarity between sentence pairs as the similarity score and report the Pearson correlation coefficient. More details of these datasets can be found in Table II.
in our experiments. Being different from STS12-STS16, the semantic relatedness datasets are under the supervised setting where we learn to predict the probability distribution of relatedness scores. The STS Benchmark dataset is a popular dataset to evaluate supervised STS systems. It contains 8,628 sentences from three categories (captions, news and forums) and they are divided into train (5,749), dev (1,500) and test (1,379). The Pearson correlation coefficient is reported.
In our experiments, we do not include the representation from the first three layers since their representations are less contextualized as reported in . Some superficial information is captured by those representations and they play a subsidiary role in most tasks . We set the context window size to in all evaluation tasks.
|Dataset||Task||Sent A||Sent B||Label|
|STS12-STS16||STS||”Don’t cook for him. He’s a grown up.”||”Don’t worry about him. He’s a grown up.”||4.2|
|STS-B||STS||”Swimmers are racing in a lake.”||”Women swimmers are diving in front of the starting platform.”||1.6|
|SICK-R||STS||”Two people in snowsuits are lying in the snow and making snow angel.”||”Two angels are making snow on the lying children”||2.5|
|Avg. GloVe embeddings||300||52.3||50.5||55.2||56.7||54.9||65.8||80.0||59.34|
|Ave. FastText embedding||300||58.0||58.0||65.0||68.0||64.0||70.0||82.0||66.43|
|Universal Sentence Encoder||512||61.0||64.0||71.0||74.0||74.0||78.0||86.0||72.57|
|Avg. BERT embedding||768||46.9||52.8||57.2||63.5||64.5||65.2||80.5||61.51|
The results are given in Table III. We see that the use of BERT outputs directly generates rather poor performance. For example, the CLS token representation gives an average correlation score of 38.93% only. Averaging BERT outputs provides an average correlation score of 61.51%. This is used as the default setting of generating sentence embedding from BERT in the bert-as-service toolkit 555https://github.com/hanxiao/bert-as-service. They are both worse than non-parameterized models such as averaging FastText word embedding, which is a static word embedding scheme. Their poor performance could be attributed to that the model is not trained using a similar objective function. The masked language model and next sentence prediction objectives are not suitable for a linear integration of representations. The study in  explains how linearity is exploited in static word embeddings (e.g., word2vec) and it sheds light on contextualized word representations as well. Among the above two methods, we recommend averaging BERT outputs because it captures more inherent structure of the sentence while the CLS token representation is more suitable for some downstream classification tasks as shown in Table V.
We see from Table III that InferSent, USE and SBERT provide the state-of-the-art performance on textual similarity tasks. Especially, InferSent and SBERT have a mechanism to incorporate the joint representation of two sentences such as the point-wise difference or the cosine similarity. Then, the training process learns the relationship between sentence representations in a linear manner and compute the correlation using the cosine similarity, which is a perfect fit. Since the original BERT model is not trained in this manner, the use of the BERT representation directly would give rather poor performance. Similarly, the XLNET model does not provide satisfactory results in STS tasks.
As compared with other methods, SBERT-WK improves the performance on textual similarity tasks by a significant margin. It is worthwhile to emphasize that we use only 768-dimension vectors for sentence embedding while InferSent uses 4096-dimension vectors. As explained in [30, 52, 14], the increase in the embedding dimension leads to increased performance for almost all models. This may explain SBERT-WK is slightly inferior to InferSent on the SICK-R dataset. For all other tasks, SBERT-WK achieves substantial better performance even with a smaller embedding size.
For supervised tasks, we compare SBERT-WK with other sentence embedding methods in the following eight downstream tasks.
MR: Binary sentiment prediction on movie reviews .
CR: Binary sentiment prediction on customer product reviews .
SUBJ: Binary subjectivity prediction on movie reviews and plot summaries .
MPQA: Phrase-level opinion polarity classification .
SST2: Stanford Sentiment Treebank with binary labels .
TREC: Question type classification with 6 classes .
MRPC: Microsoft Research Paraphrase Corpus for paraphrase prediction .
SICK-E: Natural language inference dataset .
More details on these datasets are given in Table IV.
The design of our sentence embedding model targets at the transfer capability to downstream tasks. Typically, one can tailor a pre-trained language model to downstream tasks through tasks-specific fine-tuning. It was shown in previous work ,  that subspace analysis methods are more powerful in semantic similarity tasks. However, we would like to show that sentence embedding can provide an efficient way for downstream tasks as well. In particular, we demonstrate that SBERT-WK does not hurt the performance of pre-trained language models. Actually, it can even perform better than the original model in downstream tasks under the SBERT and XLNET backbone settings.
|MR||11k||movie review||2||”A fascinating and fun film.”||pos|
|CR||4k||product review||2||”No way to contact their customer service”||neg|
|SUBJ||10k||subjectivity/objectivity||2||”She’s an artist, but hasn’t picked up a brush in a year.”||objective|
|MPQA||11k||opinion polarity||2||”strong sense of justice”||pos|
|SST2||70k||sentiment||2||”At first, the sight of a blind man directing a film is hilarious, but as the film goes on, the joke wears thin.”||neg|
|TREC||6k||question-type||6||”What is the average speed of the horses at the Kentucky Derby?”||NUM:speed|
|MRPC||5.7k||paraphrase detection||2||”The author is one of several defense experts expected to testify.”||Paraphrase|
|”Spitz is expected to testify later for the defense.|
|SICK-E||10k||entailment||3||”There is no woman using an eye pencil and applying eye liner to her eyelid.”||Contradiction|
|”A woman is applying cosmetics to her eyelid.”|
For SBERT-WK, we use the same setting as the one in semantic similarity tasks. For downstream tasks, we adopt a multi-layer-perception (MLP) model that contains one hidden layer of 50 neurons. The batch size is set to 64 and the Adam optimizer is adopted in the training. All experiments are trained with 4 epochs. For MR, CR, SUBJ, MPQA and MRPC, we use the nested 10-fold cross validation. For SST, we use the standard validation. For TREC and SICK-E, we use the cross validation.
|Avg. GloVe embeddings||300||77.9||79.0||91.4||87.8||81.4||83.4||73.2||79.2||81.66|
|Ave. FastText embedding||300||78.3||80.5||92.4||87.9||84.1||84.6||74.6||79.5||82.74|
|Universal Sentence Encoder||512||80.2||86.0||93.7||87.0||86.1||93.8||72.3||83.3||85.30|
|BERT [CLS] vector||768||82.3||86.9||95.4||88.3||86.9||93.8||72.1||73.8||84.94|
|Avg. BERT embedding||768||81.7||86.8||95.3||87.8||86.7||91.6||72.5||78.2||85.08|
|Avg. XLNET embedding||768||81.6||85.7||93.7||87.1||85.1||88.6||66.5||56.7||80.63|
The experimental results on eight supervised downstream tasks are given in Table V. Although it is desired to fine-tune deep models for downstream tasks, we see that SBERT-WK still achieves good performance without any fine-turning. As compared with the other 12 benchmarking methods, SBERT-WK has the best performance in 5 out of the 8 tasks. For the remaining 3 tasks, it still ranks among the top three. SBERT-WK with SBERT as the backbone achieves the best averaged performance (87.90%). The average accuracy improvements over XLNET and SBERT alone are 4.66% and 1.91%, respectively. For TREC, SBERT-WK is inferior to the two best models, USE and BERT[CLS], by 0.6%. For comparison, the baseline SBERT is much worse than USE, and SBERT-WK outperforms SBERT by 6.8%. USE is particularly suitable TREC since it is pre-trained on question answering data, which is highly related to the question type classification task. In contrast, SBERT-WK is not trained or fine-tuned on similar tasks. For SICK-E, SBERT-WK is inferior to two InferSent-based methods by 1.2%, which could be attributed to the much larger dimension of InferSent.
We observe that averaging BERT outputs and CLS vectors give pretty similar performance. Although CLS provides poor performance for semantic similarity tasks, CLS is good at classification tasks. This is because that the classification representation is used in its model training. Furthermore, the use of MLP as the inference tool would allow certain dimensions to have higher importance in the decision process. The cosine similarity adopted in semantic similarity tasks treats all dimension equally. As a result, averaging BERT outputs and CLS token representation are not suitable for semantic similarity tasks. If we plan to apply the CLS representation and/or averaging BERT outputs to semantic textual similarity, clustering and retrieval tasks, we need to learn an additional transformation function with external resources.
It is difficult to infer what kind of information is present in sentence representation based on downstream tasks. Probing tasks focus more on language properties and, therefore, help us understand sentence embedding models. We compare SBERT-WK on 10 probing tasks so as to cover a wide range of aspects from superficial properties to deep semantic meanings. They are divide into three types : 1) surface information, 2) syntactic information and 3) semantic information.
SentLen: Predict the length range of the input sentence with 6 classes.
WC: Predict which word is in the sentence given 1000 candidates.
TreeDepth: Predict depth of the parsing tree.
TopConst: Predict top-constituents of parsing tree within 20 classes.
BShift: Predict whether a bigram has been shifted or not.
Tense: Classify the main clause tense with past or present.
SubjNum: Classify the subject number with singular or plural.
ObjNum: Classify the object number with singular or plural.
SOMO: Predict whether the noun/verb has been replaced by another one with the same part-of-speech character.
CoordInv: Sentences are made of two coordinate clauses. Predict whether it is inverted or not.
We use the same experimental setting as that used for supervised tasks. The MLP model has one hidden layer of 50 neurons. The batch size is set to 64 while Adam is used as the optimizer. All tasks are trained in 4 epochs. The standard validation is employed. Being Different from the work in 
that uses logistic regression for the WC task in the category of surface information, we use the same MLP model to provide simple yet fair comparison.
|Avg. GloVe embeddings||300||71.77||80.61||36.55||66.09||49.90||85.33||79.26||77.66||53.15||54.15|
|Ave. FastText embedding||300||64.13||82.10||36.38||66.33||49.67||87.18||80.79||80.26||49.97||52.25|
|Universal Sentence Encoder||512||79.84||54.19||30.49||68.73||60.52||86.15||77.78||74.60||58.48||58.19|
|BERT [CLS] vector||768||68.05||50.15||34.65||75.93||86.41||88.81||83.36||78.56||64.87||74.32|
|Avg. BERT embedding||768||84.08||61.11||40.08||73.73||88.80||88.74||85.82||82.53||66.76||72.59|
|Avg. XLNET embedding||768||67.93||42.60||35.84||73.98||72.54||87.29||85.15||78.52||59.15||67.61|
The performance is shown in Table VI. We see that SBERT-WK yields better results than SBERT in all tasks. Furthermore, SBERT-WK offers the best performance in four of the ten tasks. As discussed in , there is a tradeoff in shallow and deep linguistic properties in a sentence. That is, lower layer representations carry more surface information while deep layer representations represent more semantic meanings . By merging information from various layers, SBERT-WK can take care of these different aspects.
The correlation between probing tasks and downstream tasks were studied in . They found that most downstream tasks only correlates with a subset of the probing tasks. WC is positively correlated with all downstream tasks. This indicates that the word content (WC) in a sentence is the most important factor among all linguistic properties. However, in our finding, although -means provides the best WC performance, it is not the best one in downstream tasks. Based on the above discussion, we conclude that “good performance in WC alone does not guarantee satisfactory sentence embedding and we should pay attention to the high level semantic meaning as well”. Otherwise, averaging one-hot word embedding would give perfect performance, which is however not true.
The TREC dataset is shown to be highly correlated with a wide range of probing tasks in . SBERT-WK is better than SBERT in all probing tasks and we expect it to yield excellent performance for the TREC dataset. This is verified in Table V. We see that SBERT-WK works well for the TREC dataset with substantial improvement over the baseline SBERT model.
SBERT is trained using the Siamese Network on top of the BERT model. It is interesting to point out that SBERT underperforms BERT in probing tasks consistently. This could be attributed to that SBERT pays more attention to the sentence-level information in its training objective. It focuses more on sentence pair similarities. In contrast, the mask language objective in BERT focuses more on word- or phrase-level and the next sentence prediction objective captures the inter-sentence information. Probing tasks are tested on the word-level information or the inner structure of a sentence. They are not well captured by the SBERT sentence embedding. Yet, SBERT-WK can enhance SBERT significantly through detailed analysis of each word representation. As a result, SBERT-WK can obtain similar or even better results than BERT in probing tasks.
To verify the effectiveness of each module in the proposed SBERT-WK model, we conduct the ablation study by adding one module at a time. Also, the effect of two hyper parameters (the context window size and the starting layer selection) is evaluated. The averaged results for textual semantic similarity datasets, including STS12-STS16 and STSB, are presented.
We present the ablation study results in Table VII. It shows that all three components (Alignment, Novelty, Token Importance) improve the performance of the plain SBERT model. Adding the Alignment weight and the Novelty weight alone provides performance improvement of 1.86% and 2.49%, respectively. The Token Importance module can be applied to the word representation of the last layer or the word representation obtained by averaging all layer outputs. The corresponding improvements are 0.55% and 2.2%, respectively. Clearly, all three modules contribute to the performance of SBERT-WK. The ultimate performance gain can reach 3.56%.
|Model||Avg. STS results|
|SBERT + Alignment ()||72.51|
|SBERT + Novelty ()||73.14|
|SBERT + Token Importance (last layer)||71.20|
|SBERT + Token Importance (all layers)||72.85|
We test the sensitivity of SBERT-WK to two hyper-parameters on STS, SICK-E and SST2 datasets. The results are shown in Fig. 3. The window size is chosen to be 1, 2, 3 and 4. There are at most 13 representations for a 12-layer transformer network. By setting window size to , we can cover a wide range of representations already. The performance versus the value is given in Fig. 3 (a). As mentioned before, since the first several layers carry little contextualized information, it may not be necessary to include representations in the first several layers. We choose the starting layer to be from 0-6 in the sensitivity study. The performance versus the value is given in Fig. 3 (b). We see from both figures that SBERT-WK is robust to different values of and . By considering the performance and computational efficiency, we set window size as the default value. For starting layer selection, the perform goes up a little bit when the representations of first three layers are excluded. This is especially true for the SST2 dataset. Therefore, we set as the default value. These two default settings are used throughout all reported experiments in other subsections.
We evaluate the inference speed against the STSB datasets. For fair comparison, the batch size is set to 1. All benchmarking methods are run on CPU and GPU666Intel i7-5930K of 3.50GHz and Nvidia GeForce GTX TITAN X are chosen to be the CPU and the GPU, respectively.. Both results are reported. On the other hand, we report CPU results of SBERT-WK only. All results are given in Table VIII. With CPU, the total inference time of SBERT-WK (QR) is 8.59 ms (overhead) plus 168.67ms (SBERT baseline). As compared with the baseline BERT model, the overhead is about 5%. SVD computation is slightly slower than QR factorization.
|Model||CPU (ms)||GPU (ms)|
|Overhead of SBERT-WK (SVD)||10.60||-|
|Overhead of SBERT-WK (QR)||8.59||-|
In this work, we provided in-depth study of the evolving pattern of word representations across layers in deep contextualized models. Furthermore, we proposed a novel sentence embedding model, called SBERT-WK, by dissecting deep contextualized models, leveraging the diverse information learned in different layers for effective sentence representations. SBERT-WK is efficient, and it demands no further training. Evaluation was conducted on a wide range of tasks to show the effectiveness of SBERT-WK.
Based on this foundation, we may explore several new research topics in the future. Subspace analysis and geometric analysis are widely used in distributional semantics. Post-processing of the static word embedding spaces leads to furthermore improvements on downstream tasks [37, 62]. Deep contextualized models have achieved supreme performance in recent natural language processing tasks. It could be beneficial by incorporating subspace analysis in the deep contextualized models to regulate the training or fine-tuning process. This representation might yield even better results. Another topic is to understand deep contextualized neural models through subspace analysis. Although deep contextualized models achieve significant improvements, we still do not understand why these models are so effective. Existing work that attempts to explain BERT and the transformer architecture focuses on experimental evaluation. Theoretical analysis of the subspaces learned by deep contextualized models could be the key in revealing the myth.
Computation for the work was supported by the University of Southern California’s Center for High Performance Computing (hpc.usc.edu).
T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” inAdvances in neural information processing systems, 2013, pp. 3111–3119.
A. Conneau, D. Kiela, H. Schwenk, L. Barrault, and A. Bordes, “Supervised learning of universal sentence representations from natural language inference data,” inProceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Copenhagen, Denmark: Association for Computational Linguistics, September 2017, pp. 670–680.
K. Ethayarajh, “How contextual are contextualized word representations? comparing the geometry of bert, elmo, and gpt-2 embeddings,”Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, 2019.
Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, 2019, pp. 6949–6956.
C.-C. J. Kuo, “Understanding convolutional neural networks with a mathematical model,”Journal of Visual Communication and Image Representation, vol. 41, pp. 406–413, 2016.
B. Pang and L. Lee, “A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts,” inProceedings of the 42nd annual meeting on Association for Computational Linguistics. Association for Computational Linguistics, 2004, p. 271.